Gemma 4 12B: A unified, encoder-free multimodal model

blog.google

981 points by rvz a day ago


senko - a day ago

I ran the Q4 quant (used with llama.cpp) though my "minesweeper" vibe-coding benchmark: https://senko.net/vibecode-bench/2026/minesweeper-gamma-4-12...

The result is decent, but it had a few bizzare/trivial syntax errors I had to fix manually: it would do an extra closing bracket or paren a few times, and wanted to separate function definitions with comma. Not sure what that was about, but otherwise the output run just fine.

So, with those qualifiers, I think it's a decent local coding model. It roughly compares with GPT-4.1 (!!), released 14 months ago, on the output: https://senko.net/vibecode-bench/2025/minesweeper-gpt-4.1.ht... (actually I'd call it better, but those syntax errors...)

I ran the quantized version (4-bit GGUF) on my consumer-grade card with 12G of VRAM and got 5t/s for output. Not for interactive use for coding, but fairly capable model.

To me, it's fascinating how much progress we got in over a year. GPT-4.1 was considered an extremely capable coding model. Now we got something with 12B of params performing roughly the same (in this specific benchmark, disclaimers, etc).

Lists of various models I tested: https://senko.net/vibecode-bench/

minimaxir - a day ago

The big story here is the encoder-free part, which I still don't fully understand.

> Vision: We replaced Gemma 4’s vision encoder with a lightweight embedding module consisting of a single matrix multiplication, positional embedding and normalizations.

That's technically encoding, just without using a dedicated model for it like SigLIP? The Developer's Guide elaborates, it's still a 35M layer which I am curious is robust enough. https://developers.googleblog.com/gemma-4-12b-the-developer-...

> Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine.

I am assuming that involves quantization, which due to the quality loss makes that statement somewhat misleading IMO.

asim - a day ago

We are now entering the closed loop game. Google doesn't need anyone else to accelerate their models. This is their bread and butter.

I'm both shocked but also not surprised that they continue to develop such efficiencies. Honestly it's like silicon and CPU architecture advancement. We kept shrinking it and shrinking it and it kept getting more and more powerful and here we are with AI and it's only going to be 100x more efficient with time. Maybe there's some point of decay but essentially the next 30 years will be more advanced than the last 30 and were going to be living in some sort of futurist blade runner scenario where gene editing is repairing ageing cells, organs and curing all sorts of cancers that haven't even appeared yet. Beyond our lifetimes people will live to 125 quite steadily and with great mobility and then obviously people will look to how do we get to living 1000 years, which of anyone is religious knows Noah and others lived to that age in a totally different era.

Anyway I'm going off on some tangent but look back 30 years. Now look forward 30 years. It's going to be insane. May God protect us.

ethanpil - a day ago

What's Google's business case for releasing open models? Don't get me wrong, I am grateful and appreciative of these releases. I'm trying to understand how it fits into their bigger picture as a for profit company? Are they not helping competitors build on the novel technology they have developed?

Is it simply goodwill and/or marketing? Or am I missing something strategic?

petercooper - a day ago

Its image processing is terrible. I ran several tests against it against Qwen 3.5 0.8b (yes, 7% the size) and Qwen beat it every time with Gemma often getting things entirely wrong. I even gave it a plain image saying "This is a test" and it thought for 6 minutes trying to analyze it and failed. Qwen 3.5 0.8b confidently got it in under a second.

It may be that the Q6 quant I got is borked (or my LM Studio is), but either way, the 0.8b's performance is mind boggling in comparison.

accrual - 4 hours ago

Splendid model, it reminds me of Gemma3 27B which was my favorite local model last year. Gemma always had a bit more warmth/empathy compared to Qwen and Mistral in my experience and I found it more useful for personal questions.

My system has a 4080 Super (16GB) installed and using llama.cpp (b9333-35c9b1f39) I got these results on a test prompt:

* Qwen3.5-9B-Q6_K.gguf - Prompt: 1492.0 t/s | Generation: 81.0 t/s

* gemma-4-12b-it-Q4_K_M.gguf - Prompt: 1329.2 t/s | Generation: 72.3 t/s

* gemma-4-12b-it-Q8_0.gguf - Prompt: 504.4 t/s | Generation: 25.2 t/s

ComputerGuru - a day ago

Quite aside from the architectural changes, I suppose this is the answer to why Google had such a glaring hole in the (pretrained) Gemma4 model lineup between the Gemma4 4b and Gemma4 26b models!

A model that comfortably fits in 16GB of VRAM (allowing room for context) is a welcome upgrade.

djyde - a day ago

What are the use cases for these small models? Is there anyone using models of this scale in their daily life who could share their experience?

nickandbro - a day ago

Wow Google is becoming the new pre Llama 4 Meta when it comes to releasing open weights models.

outageroom - 7 hours ago

I really like the idea of small models that you can get the most out of. If I weren't a programmer, I wouldn't even know what I would use Opus 4.8 or GPT 5.5 models for.

dwa3592 - a day ago

This is a pretty good update. The demo video is a bit funny though - the tester asks to turn the release into bullet points. okay, the model obliges. then the tester says draft an email with this content. BAM! the LLM turns the content from bullets to passages even though it was not asked and it undid the last good thing that it did. i am not sure if it's an email etiquette to not put bullets in the email.

briansm - 7 hours ago

Strange that they are feeding raw audio in. Even in humans, there is a hardware transform to the frequency domain (the cochlea) before data is fed to the brain, effectively doing this part in the LLM seems inefficient.

julianlam - a day ago

Last time I tried Gemma 4 (26B-A4B) its memory usage would balloon and consume all of my swap until my machine died.

Qwen 3.6 on the other hand barely uses any memory at all for its KV cache.

baalimago - 7 hours ago

I don't understand why Google does this. If I can run this locally, why would I need a subscription or use any inference provider, including Google..?

Scorched earth tactics to make anthropic and openai IPO fail?

kristianp - 20 hours ago

What quantisation do the creators intend this to be run at? They talk about 16GB of ram, so should it be run at 8 bit? People here are talking about using q4, but I would have thought a smaller model like this wouldn't perform well at such low bits per parameter. Edit, it looks like their bechmarks would have been done at 16 bit float, as the hugging face release is that size: https://huggingface.co/google/gemma-4-12B . Which is a little deceptive: they're advertising an 8 bit size will fit on 16GB laptops, while releasing a 16bit size.

I guess we have to wait for someone to produce perplexity curves at different Q's.

dgacmu - 18 hours ago

I was excited about this until I fed it one of my local test problems: coin identification. I then spent 10 minutes arguing with it that a photo of a 1998 washington quarter was not, in fact, a Morgan Silver Dollar. I mean, I wish it was.

It went into a crash loop on a british columbia 1 dollar coin. This happened with both Q4_1 and Q8. Maybe I'm holding it wrong or it's just really bad for this task.

In contrast, gemma4 gets the british columbia coin right though it also mis-identifies the quarter. gemini 3.1-flash-lite nails them both.

Was getting about 50 t/s output on a 3090 with Q8 which seems ok.

scirob - a day ago

Quickly deployed it to check some benchmarks relevant for German language. These are results for CohereLabs/include-base-44 german only : Gemma 4 12B %61.9

  Gemma 4 26B (a4b MoE)    0.647
  Qwen 3 14B               0.621 
  Gemma 4 12B              0.618
  Ministral 14B 2512       0.604 
  Gemma 3 12B              0.547
The quwen 3 14B vs Gemma 4 12B difference is within random variance they same in some repeat runs they actually got the exact same score. Next step up Gemma 4 31B gets 0.676 on this. Or let in some reasoning Qwen 3 14B (reasoning) 0.676.

I'll run some cheat-proof benchmarks ones tomorrow see if qwen is still on top.

lxgr - a day ago

Am I missing something or are the Ollama versions of this (https://ollama.com/library/gemma4/tags) text-only for now?

__natty__ - a day ago

It’s fascinating for me to see how small language models grow recently in capabilities while still consumer friendly in size to run on their machines

Zambyte - a day ago

Is this Mac only? Or is that an Ollama issue that it only supports this release of models on Mac? It seems like every tag with the MLX badge is only supported on Mac[0], and that includes all of the tags in this release.

[0] https://ollama.com/library/gemma4/tags

Edit: MLX being Mac-only is independent of the model being MLX (and therefore Mac) only. The latter is what I am asking about.

benbojangles - 12 hours ago

I run gemma-4-26b-bf16 in mtp mode and it runs very smooth, spitting out answers in seconds and outputting text 30x faster than i can read.

wuyunhuo - 15 hours ago

The optimal small-model solution, delivering multimodal, reasoning, and coding experiences on affordable hardware that were remarkably close to those of mid-to-large models at the time.

thomasjb - a day ago

Unfortunately there's no gguf quants of the assistant model yet: https://huggingface.co/models?other=base_model:quantized:goo...

christina97 - a day ago

It seems worse in all aspects to the 26B A4B? I would have thought dense models beat MoE still on many benchmarks?

Is the entire point of this model then that it runs if you don’t have enough GPU memory to load the 26B? That one runs faster anyway due to lower active params.

jamwise - 20 hours ago

"Small enough to run locally with just 16GB of VRAM or unified memory"

With many laptops dropping back down to 8GB because of the memory shortage there's some interesting pressures building in the industry.

RandyOrion - a day ago

A small dense multimodal model with audio support, interesting.

Wait, *Excluding Chinese language.

This is ... curious.

P.S. Where is gemma 4 124b?

randomNumber7 - a day ago

> Novel unified architecture: No multimodal encoders. The vision and audio inputs flow directly into the LLM backbone.

I would be interested in how this actually works. I couldn't find a description of the model architecture (and I did check the links in the Google blog)

anonova - a day ago

Do Gemma 4 models compete with Gemini 3.1 Flash-Lite? I would assume even the smallest Gemini model would outperform even Gemma 4 31B, but I can't really get a sense of performance or output quality difference.

Havoc - a day ago

Quite a niche release. The MoE outperforms it on score and will likely be faster thanks to lower active weights. So this really only makes sense for specific ram constrained applications that can’t fit a quantized MoE

benbojangles - 11 hours ago

why combine audio & image analysis into an llm though, why not allow the user to choose their own audio & image analysis alongside their own llm choice?

spott - a day ago

Is there a paper on this?

I'm curious how they pre-trained it... I feel like it must have had audio/image output that they chopped off.

I wonder how hard it would be to add it back on.

foota - 21 hours ago

It feels like this would be beneficial to give the model more of a deep understanding of visual knowledge.

SubiculumCode - a day ago

"Laptop ready: Small enough to run locally with just 16GB of VRAM or unified memory." I wish. I just have 12.

zuminator - a day ago

How does it compare with e4b, aside from being larger?

zkmon - a day ago

It's quite interesting to see the quants pour into the HF page. I keep refreshing it and see many new quants every few mins.

BiraIgnacio - a day ago

using an embedder instead of a decoder is quite clever. Not sure who came up with that first but it's a cool idea.

semiinfinitely - a day ago

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away

- a day ago
[deleted]
comma_at - a day ago

Are there qwen or minimax or other open weight models of same hardware requirements that outperform this?

synergy20 - 3 hours ago

ollama does not support this yet, what else can I try

4k4 - 21 hours ago

I'm actually thinking how much this is bett3r (besides multimedia) over prismml's 1.5bit model based on qwen2.5 or sth.

zkmon - a day ago

I'm waiting for FP8 quant, preferably from Google.

adt - a day ago

https://lifearchitect.ai/models-table/

easygenes - 19 hours ago

I want to like the vision capabilities of the model. However, when I gave it an image which Gemma 26B A4B and Qwen 3.6 35B A3B has no problem correctly describing in detail, including identifying the Taj Mahal in the background it utterly failed. Its sense of the image was that it was a "distorted wide panorama" and even when I asked directly if it was the Taj Mahal it said no. The reference models saw it correctly as a normal square image taken from a fairly rectilinear lens (iPhone main camera).

SuperV1234 - a day ago

How does this compare to frontier models?

claysmithr - a day ago

I don’t see the download in lm studio

dyauspitr - a day ago

Just tried this out. Jesus Christ. Google does some things so well.

alienjesus - 19 hours ago

good one, wanna try on Cerebras inference in Agentic Browsing

powera - a day ago

I'm seeing very low quality results on LMStudio with this model. Worse than Gemma 3 12B.

It is getting questions like "David has 18 apples and Ivan has 7 apples. How many apples do they have together?" wrong half the time, while Gemma3 12B could very consistently answer that. Other smoke tests (like Chinese translation, and the infamous "Rs in Strawberry" test) also show poor results.

I don't know if it is a quantization/release issue, if the parameters needed for accurate responses have changed (i.e. it needs "thinking" tokens to handle its base error rate), or if the model has been so focused on audio/video that the text processing is bad.

keyle - 12 hours ago

Not terribly impressed with this one. I asked it for recommendation between Paris to Berlin and option 3 was Rome... and option 4 was Tokyo.

mmmkay.

jdelman - a day ago

I can’t help but wonder if this is the basis of the model they’ve helped tune for Apple.

mlmonkey - a day ago

Is there some place where we can try it before downloading the gigabytes of weights?

t0lo - 14 hours ago

Asked it to name the director who wears a rolex and likes submarines. It said christopher nolan.

- a day ago
[deleted]
kordlessagain - a day ago

Cool!

Miles_Stone - 9 hours ago

[flagged]

Lapsa - a day ago

[dead]

digdugdirk - a day ago

I do enjoy the immediate out of touch signaling with the "runs on your 16gb vram laptop" line. Because everyone has a laptop with 16gb vram, or can just pop out and buy a new one, right?