Gemini 3 Flash: Frontier intelligence built for speed
blog.google1041 points by meetpateltech a day ago
1041 points by meetpateltech a day ago
Docs: https://ai.google.dev/gemini-api/docs/gemini-3
Developer Blog: https://blog.google/technology/developers/build-with-gemini-...
Model Card [pdf]: https://deepmind.google/models/model-cards/gemini-3-flash/
Gemini 3 Flash in Search AI mode: https://blog.google/products/search/google-ai-mode-update-ge...
Deepmind Page: https://deepmind.google/models/gemini/flash/
Don’t let the “flash” name fool you, this is an amazing model. I have been playing with it for the past few weeks, it’s genuinely my new favorite; it’s so fast and it has such a vast world knowledge that it’s more performant than Claude Opus 4.5 or GPT 5.2 extra high, for a fraction (basically order of magnitude less!!) of the inference time and price Oh wow - I recently tried 3 Pro preview and it was too slow for me. After reading your comment I ran my product benchmark against 2.5 flash, 2.5 pro and 3.0 flash. The results are better AND the response times have stayed the same.
What an insane gain - especially considering the price compared to 2.5 Pro.
I'm about to get much better results for 1/3rd of the price. Not sure what magic Google did here, but would love to hear a more technical deep dive comparing what they do different in Pro and Flash models to achieve such a performance. Also wondering, how did you get early access? I'm using the Gemini API quite a lot and have a quite nice internal benchmark suite for it, so would love to toy with the new ones as they come out. Curious to learn what a “product benchmark” looks like. Is it evals you use to test prompts/models? A third party tool? Examples from the wild are a great learning tool, anything you’re able to share is appreciated. Everyone should have their own "pelican riding a bicycle" benchmark they test new models on. And it shouldn't be shared publicly so that the models won't learn about it accidentally :) Any suggestions for a simple tool to set up your own local evals? Just ask LLM to write one on top of OpenRouter, AI SDK and Bun
To take your .md input file and save outputs as md files (or whatever you need)
Take https://github.com/T3-Content/auto-draftify as example My "tool" is just prompts saved in a text file that I feed to new models by hand. I haven't built a bespoke framework on top of it. ...yet. Crap, do I need to now? =) Well you need to stop them from getting incorporated into its training data Yeah I’ve wondered about the same myself… My evals are also a pile of text snippets, as are some of my workflows. Thought I’d have a look to see what’s out there and found Promptfoo and Inspect AI. Haven’t tried either but will for my next round of evals May I ask your internal benchmark ? I'm building a new set of benchmarks and testing suite for agentic workflows using deepwalker [0]. How do you design your benchmark suite ? would be really cool if you can give more details. I'm a significant genAI skeptic. I periodically ask them questions about topics that are subtle or tricky, and somewhat niche, that I know a lot about, and find that they frequently provide extremely bad answers. There have been improvements on some topics, but there's one benchmark question that I have that just about every model I've tried has completely gotten wrong. Tried it on LMArena recently, got a comparison between Gemini 2.5 flash and a codenamed model that people believe was a preview of Gemini 3 flash. Gemini 2.5 flash got it completely wrong. Gemini 3 flash actually gave a reasonable answer; not quite up to the best human description, but it's the first model I've found that actually seems to mostly correctly answer the question. So, it's just one data point, but at least for my one fairly niche benchmark problem, Gemini 3 Flash has successfully answered a question that none of the others I've tried have (I haven't actually tried Gemini 3 Pro, but I'd compared various Claude and ChatGPT models, and a few different open weights models). So, guess I need to put together some more benchmark problems, to get a better sample than one, but it's at least now passing a "I can find the answer to this in the top 3 hits in a Google search for a niche topic" test better than any of the other models. Still a lot of things I'm skeptical about in all the LLM hype, but at least they are making some progress in being able to accurately answer a wider range of questions. I don't think tricky niche knowledge is the sweet spot for genai and it likely won't be for some time. Instead, it's a great replacement for rote tasks where a less than perfect performance is good enough. Transcription, ocr, boilerplate code generation, etc. The thing is, I see people use it for tricky niche knowledge all the time; using it as an alternative to doing a Google search. So I want to have a general idea of how good it is at this. I found something that was niche, but not super niche; I could easily find a good, human written answer in the top couple of results of a Google search. But until now, all LLM answers I've gotten for it have been complete hallucinated gibberish. Anyhow, this is a single data point, I need to expand my set of benchmark questions a bit now, but this is the first time that I've actually seen progress on this particular personal benchmark. That’s riding hype machine and throwing baby with bath water. Get an API and try to use it for classification of text or classification of images. Having an excel file with somewhat random looking 10k entries you want to classify or filter down to 10 important for you, use LLM. Get it to make audio transcription. You can now just talk and it will make note for you on level that was not possible earlier without training on someone voice it can do anyone’s voice. Fixing up text is of course also big. Data classification is easy for LLM. Data transformation is a bit harder but still great. Creating new data is hard so like answering questions where it has to generate stuff from thin air it will hallucinate like a mad man. The ones that LLMs are good in are used in background by people creating actual useful software on top of LLMs but those problems are not seen by general public who sees chat box. But people using the wrong tool for a task is nothing new. Using excel as a database (still happening today), etc. Maybe the scale is different with genAI and there are some painful learnings ahead of us. I also use niche questions a lot but mostly to check how much the models tend to hallucinate. E.g. I start asking about rank badges in Star Trek which they usually get right and then I ask about specific (non existing) rank badges shaped like strawberries or something like that. Or I ask about smaller German cities and what's famous about them. I know without the ability to search it's very unlikely the model actually has accurate "memories" about these things, I just hope one day they will acutally know that their "memory" is bad or non-existing and they will tell me so instead of hallucinating something. I'm waiting for properly adjusted specific LLMs. A LLM trained on so much trustworth generic data that it is able to understand/comprehend me and different lanugages but always talks to a fact database in the background. I don't need an LLM to have a trillion parameters if i just need it to be a great user interface. Someone is probably working on this somewere or will but lets see. And Google themselves obviously believe that too as they happily insert AI summaries at the top of most serps now. Or maybe Google knows most people search inane, obvious things? Or more likely Google couldn't give a rat's arse whether those AI summaries are good or not (except to the degree that people don't flee it), and what it cares is that they keep users with Google itself, instead of clicking of to other sources. After all it's the same search engine team that didn't care about its search results - it's main draw - activey going shit for over a decade. Google AI Overview a lot of times write wrong about obvious things so... lol They probably use old Flash Lite model, something super small, and just summarize the search... Those summaries would be far more expensive to generate than the searches themselves so they're probably caching the top 100k most common or something, maybe even pre-caching it. Second this. Basically making sense of unstructured data is super cool. I can get 20 people to write an answer the way they feel like it and model can convert it to structured data - something I would have to spend time on, or I would have to make form with mandatory fields that annoy audience. I am already building useful tools with the help of models. Asking tricky or trivia questions is fun and games. There are much more interesting ways to use AI. Well, I used Grok to find information I forgot about like product names, films, books and various articles on different subjects. Google search didn't help but putting the LLM at work did the trick. So I think LLMs can be good for finding niche info. Yeah, but tests like that deliberately prod the boundaries of its capability rather than how well it does what it’s good at. Counter point about general knowledge that is documented/discussed in different spots on the internet. Today I had to resolve performance problems for some sql server statement. Been doing it years, know the regular pitfalls, sometimes have to find "right" words to explain to customer why X is bad and such. I described the issue to GPT5.2, gave the query, the execution plan and asked for help. It was spot on, high quality responses and actionable items and explanations on why this or that is bad, how to improve it and why particularly sql may have generated such a query plan. I could instantly validate the response given my experience in the field. I even answered with some parts of chatgpt on how well it explained. However I did mention that to customer and I did tell them I approve the answer. Asked high quality question and receive a high quality answer. And I am happy that I found out about an sql server flag where I can influence particular decision. But the suggestion was not limited to that, there were multiple points given that would help. So this is an interesting benchmark, because if the answer is actually in the top 3 google results, then my python script that runs a google search, scrapes the top n results and shoves them into a crappy LLM would pass your benchmark too! Which also implies that (for most tasks), most of the weights in a LLM are unnecessary, since they are spent on memorizing the long tail of Common Crawl... but maybe memorizing infinite trivia is not a bug but actually required for the generalization to work? (Humans don't have far transfer though... do transformers have it?) I've tried doing this query with search enabled in LLMs before, which is supposed to effectively do that, and even with that they didn't give very good answers. It's a very physical kind of thing, and its easy to conflate with other similar descriptions, so they would frequently just conflate various different things and give some horrible mash-up answer that wasn't about the specific thing I'd asked about. So it's a difficult question for LLMs to answer even when given perfect context? Kinda sounds like you're testing two things at the same time then, right? The knowledge of the thing (was it in the training data and was it memorized?) and the understanding of the thing (can they explain it properly even if you give them the answer in context). Hi. I am curious what was the benchmark question? Cheers! The problem with publicly disclosing these is that if lots of people adopt them they will become targeted to be in the model and will no longer be a good benchmark. Yeah, that's part of why I don't disclose. Obviously, the fact that I've done Google searches and tested the models on these means that their systems may have picked up on them; I'm sure that Google uses its huge dataset of Google searches and search index as inputs to its training, so Google has an advantage here. But, well, that might be why Googles new models are so much better, they're actually taking advantage of some of this massive dataset they've had for years. This thought process is pretty baffling to me, and this is at least the second time I've encountered it on HN. What's the value of a secret benchmark to anyone but the secret holder? Does your niche benchmark even influence which model you use for unrelated queries? If LLM authors care enough about your niche (they don't) and fake the response somehow, you will learn on the very next query that something is amiss. Now that query is your secret benchmark. Even for niche topics it's rare that I need to provide more than 1 correction or knowledge update. I have a bunch of private benchmarks I run against new models I'm evaluating. The reason I don't disclose isn't generally that I think an individual person is going to read my post and update the model to include it. Instead it is because if I write "I ask the question X and expect Y" then that data ends up in the train corpus of new LLMs. However, one set of my benchmarks is a more generalized type of test (think a parlor-game type thing) that actually works quite well. That set is the kind of thing that could be learnt via reinforcement learning very well, and just mentioning it could be enough for a training company or data provider company to try it. You can generate thousands of verifiable tests - potentially with verifiable reasoning traces - quite easily. Ok, but then your "post" isn't scientific by definition since it cannot be verified. "Post" is in quotes because I don't know what you're trying to but you're implying some sort of public discourse. For fun: https://chatgpt.com/s/t_694361c12cec819185e9850d0cf0c629 I didn't see anyone claiming any 'science'? Did I miss something? I guess there's two things I'm still stuck on: 1. What is the purpose of the benchmark? 2. What is the purpose of publicly discussing a benchmark's results but keeping the methodology secret? To me it's in the same spirit as claiming to have defeated alpha zero but refusing to share the game.
samyok - a day ago
thecupisblue - 21 hours ago
lancekey - 14 hours ago
theshrike79 - 7 hours ago
ggsp - 6 hours ago
dimava - an hour ago
theshrike79 - 4 hours ago
kedihacker - 36 minutes ago
ggsp - 2 hours ago
m00dy - 9 hours ago
lambda - 17 hours ago
prettyblocks - 16 hours ago
lambda - 15 hours ago
ozim - 8 hours ago
illiac786 - 8 hours ago
katzenversteher - 9 hours ago
Europas - 2 hours ago
mikepurvis - 15 hours ago
ComputerGuru - 15 hours ago
coldtea - 14 hours ago
vitorgrs - 14 hours ago
mikepurvis - 9 hours ago
ozim - 16 hours ago
DeathArrow - 5 hours ago
DrewADesign - 12 hours ago
jve - an hour ago
andai - 15 hours ago
lambda - 15 hours ago
andai - 5 hours ago
TeodorDyakov - 16 hours ago
Turskarama - 16 hours ago
lambda - 16 hours ago
grog454 - 14 hours ago
nl - 13 hours ago
grog454 - 13 hours ago
eru - 12 hours ago
grog454 - 12 hours ago