Gemini Diffusion
simonwillison.net862 points by mdp2021 2 days ago
862 points by mdp2021 2 days ago
I have no idea how it works actually (in google) but I wouldn't be surprised if it was just post-training because recently RWKV people did something similar: They replaced the whole attention mechanism with WKV (forward-only linear attention), and created such franken-stein just by post-training.
The big wow moment about that is that it sort of implies that most of the useful knowledge is in the FFN, and attention itself is not that unique/important.
https://substack.recursal.ai/p/qwerky-72b-and-32b-training-l...
BTW: It could be also interesting to try use already trained attention and see how long the FFN itself takes in the gpt2 speedtraining (it would be against the rules but still very interesting IMHO - definitely something I'd like to read paper about) https://github.com/KellerJordan/modded-nanogpt
Also, I read yesterday that at some point, the embeddings across all of the models are (very) comparable/similar, and a simple converter can be trained, and if both of these statements are true maybe we could just train everything much faster just by sharing fixed embeddings and attentions.
Ever notice that attention is (with the highest respect to the original researchers) "just" inputting the entire past of the network into a reverse-MoE neural network? (meaning the expert is selecting parts of the input instead of parts of the neural network to execute)
In a way everyone knew this would work. Nobody did it because it's so inefficient even R and Python users thought that it would be ridiculously slow (or simply couldn't execute it enough to train to a reasonable extent)
Attention is just completely arbitrary way to split the network so the learning can be parallelized.
What contributed more towards success in my opinion are "shortcut connections" through layers which enable more influence on early layers during learning.
> What contributed more towards success in my opinion are "shortcut connections" through layers which enable more influence on early layers during learning.
For those who don't know, that is the idea behind ResNet (He et al., Deep Residual Learning for Image Recognition, https://arxiv.org/abs/1512.03385), one of the most influential papers in deep learning of all time.
Residual connections make it possible to train networks that are arbitrarily deep. Before ResNet, networks that were too deep were essentially not trainable due to vanishing or exploding gradients.
> Also, I read yesterday that at some point, the embeddings across all of the models are (very) comparable/similar, and a simple converter can be trained
That was from here: https://news.ycombinator.com/item?id=44054425
The relative unimportance of the exact SDPA attention in use in modern transformers is already known: https://arxiv.org/abs/2111.11418
The FFN, normalization, and residual connections are absolutely irreplaceable -- but attention can be replaced with almost any other layer that shares information between tokens, such as pooling, convolution, random mixing, etc.
That's...ridiculously fast.
I still feel like the best uses of models we've seen to date is for brand new code and quick prototyping. I'm less convinced of the strength of their capabilities for improving on large preexisting content over which someone has repeatedly iterated.
Part of that is because, by definition, models cannot know what is not in a codebase and there is meaningful signal in that negative space. Encoding what isn't there seems like a hard problem, so even as models get smarter, they will continue to be handicapped by that lack of institutional knowledge, so to speak.
Imagine giving a large codebase to an incredibly talented developer and asking them to zero-shot a particular problem in one go, with only moments to read it and no opportunity to ask questions. More often than not, a less talented developer who is very familiar with that codebase will be able to add more value with the same amount of effort when tackling that same problem.
The trick to this is you've got to talk to them and share this information in the same way. I can give an example. These days my main workflow is as follows: if I have some big feature/refactor/whatever I'm going to work on I'll just start talking to o3 about it essentially as if it was a coworker and (somewhat painstakingly) paste in relevant source files it needs for context. We'll have a high-level discussion about what it is we're trying to build and how it relates to the existing code until I get the sense o3 has a clear and nuanced understanding (these discussions tend to sharpen my own understanding as well). Then, I'll ask o3 to generate an implementation plan that describes what needs to happen across the codebase in order for whatever it is to be realized. I'll then take that and hand it off to Codex, which might spend 10min executing shell commands to read source, edit files, test, etc. and then I've got a PR ready, which sometimes takes a bit more manual editing, and other times is perfectly ready to merge.
What you're saying is true RE them needing rich context, too—but this isn't a fundamental limitation, it's just an aspect of what it takes to work with them effectively. There's definitely a learning curve but once you've got it down it's not only very powerful but, for me anyway, a more enjoyable headspace to occupy than lots of lower level manual editing.
I would suggest trying the Continue.dev VSCode plugin for selective context injection. The plugin is Apache 2.0 licensed, and you can hook it up to any LLM API including local.
It has most of the same features as GitHub Copilot, but a few extra features I find essential. It can scrape documentation sites for individual libraries, which means you can do stuff like `@pandas @terminal @codebase Help me fix this error`.
For greenfield projects I will usually start out in a web-based chat interface, but the second I need to go back and forth between IDE and the web I switch over to the Continue.dev plugin.
Interesting approach, I'm definitely going to steal your wording for "generate an implementation plan that...".
I do something similar but entirely within Cursor:
1. create a `docs/feature_name_spec.md`, use voice-to-text to brain dump what I am trying to do 2. open up a the AI chat panel in "Ask" mode while referencing that spec file, ask (paste) a boilerplate snippet like: "1) Ask clarifying questions about intent, domain, restrictions, ambiguity or missing details 2) Briefly identify any missing documents, data, or background information that would help you complete the task thoroughly" 3. move that list of questions into the spec doc and answer them there, attach the files it asked for and just rerun the above request (optionally, switching to a different model, like gemini-2.5-pro -> o3, for different perspective) 4. ask it to make an execution plan and at that point i have a fully spec'd out feature and documented business logic, I either use the Edit mode on each step or Agent mode
That's for more complex features touching many files or refactors, but I essentially do a simplified version of that within the same chat by editing my original chat prompt until I'm confident I explained myself well
I spend so much time just finding/moving context pieces around these days i bought a physical macro pad and have been thinking about designing some software specifically to make this quicker, basically like rapidly finding/selecting context pieces and loading into buffers and relaying to conversation context. I think it’ll have to be backed by agentic search, voice controlled, and not sure how to best integrate with possible consumers… I dunno if that makes sense. I started building it and realized I need to think on the design a bit more so I’m building more like infrastructure pieces now.
That's very close to my workflow: https://taoofmac.com/space/blog/2025/05/13/2230
I find myself using a similar workflow with Aider. I'll use chat mode to plan, adjust context, enable edits, and let it go. I'll give it a broad objective and tell it to ask me questions until the requirements are clear, then a planning summary. Flipping the script is especially helpful when I'm unsure what I actually want.
"...what is not in a codebase, and there is meaningful signal in that negative space."
Man, I'm writing software for money for decades now, but this fundamental truth never occured to me, at least not consciously and with such clarity.
So, thank you!
I am not certain that I agree with this. If there are alternative ways of solving a problem that we're not taken then these should be documented in comments. A mantra I try to tell myself and my colleagues is if information exists in your brain and nowhere else then write down it down _somewhere_. If I tried 5 different libraries before settling on one, then I write in comments which libraries I tried but didn't work and why. If I used a particular tool to debug a race condition then I put a link to a wiki page on how to use it in the comments. If we have one particular colleague who is an expert in some area then I write their name in a comment. Basically anything that is going to save future developers' time should be written down.
Agreed. IMO it's always a good idea to document design choices.
The owner can write down the problem, a few solutions that were considered, why they were chosen/rejected, and a more detailed description of the final design. Stakeholders then review and provide feedback, and after some back and forth all eventually sign off the design. That not only serves to align the organization, but to document why things were done that way, so that future hires can get a sense of what is behind the code, and who was involved in case they have more questions.
This was how we did things at some $BigCorps and it paid dividends.
What are you disagreeing with?
Even if you do this (and it's good practice!), it is, empirically, not done in the vast majority of codebases.
And even if you succeed with the utmost diligence, a vastly greater number of decisions (those you were not even aware of consciously, or took for granted) will remain undocumented but still be quite real in this "negative space" sense.
My pleasure ;-) I borrowed the term from art: https://www.michaelalfano.com/tag/negative-space/?id=400
I'm an artist who works on pre-production fast turnaround animations for films, and yeah that hits the nail on the head, knowing what NOT to do which elements not to focus on is a majority of the power that comes with experience. I'm fast because I know which corners can be cut best and how to illustrate what I need to
Then document it. Whenever you choose one algorithm/library/tech stack but not another, write your consideration in the documents.
The funny thing is that I have at least a dozen comments in my current codebase where I explain in detail why certain things are not put in place or are not served via other-solution-that-might-seem-obvious.
I understand what negative space is in art. Can you explain how this applies to writing software ?
A quick example is a basic 2d game. If you’re not using an engine (just a graphic library) and you have some animations, experience will tell you to not write most of the code with numbers only. More often than not, you will write a quick vector module. Just how you will use local origin for transformations.
But more often than not, the naive code is the result of not doing the above and just writing the feature. It technically does the job, but it’s verbose and difficult to maintain.
So just like in drawing, you need to think holistically about the program. Every line of code should support an abstraction. And that will dictate which code to write and which to not write.
That’s why you often see the concept of patterns in software. The code is not important. The patterns are. The whole structure more so. Code is just what shape these.
I have written 2D games, but maybe the metapher is just lost on me or I simply disagree to its usefulness here.
Negative space in art achieves a certain effect. Like in the linked sibling comment, the empty space is part of the sculpture.
So the empty space has purpose and meaning.
But if I didn't choose a certain libary .. the empty place of that libary serves no function. It does change my code and might make my dev life easier or harder, but has no meaning in itself for the result.
Let me take a crack at it.
I think the negative space metaphor in software can be in the shape of the abstractions and hitting the sweet spot of making the right things easy/railroaded while not over engineering it.
In visual art, negative space is part of the layout and the visual journey. It helps define the relationships between things as much as those things themselves and, used judiciously, is one of the differences between elegance and clutter.
I think "not choosing a library" is important info but isn't the same thing as negative space and is instead more like restrictions, framing, or limitation. You can do a lot with what isn't shown but in this area I think good art and good software diverge in goals - to me good art makes me think or feel or speculate while good software instead makes me understand with as little of those other things as possible.
The caveat here might be not choosing things for very good but not obvious reasons, which should be loudly documented. Things like licensing or other external influences or specific hardware requirements maybe. For example I once banned the creation of a graphQL api in a product that could have benefited from it because we still needed to support the existing api for third parties forever so the suggestion to replace the api was actually secretly the suggestion to maintain two APIs in lockstep.
Yes the code is not actually important as two different teams will solve the same problem in different manners. Just like a great painting and a bad one can use the same base materials. What’s important is the purpose and the constraints of any solution. Any decision you take propagates down the timeline and outward in the project. And they preclude other decisions from being taken.
So whatever you do will live a mark. But there are some spaces that should not be filled in. While it may look nice in the moment or taken in isolation. When looking at the whole, it makes it a mess.
I’m talking more about architecting code instead of naively writing them. The same point can be made about libraries but the considerations are more subjective.
Most naive approaches to writing software looks like assembly. But instead of opcodes, you have libraries functions. But we move away from assembly and assembly like programming because it’s essentially one shot. Any modification to the program is difficult and/or tedious. So instead of having that one blob of instructions, we introduce gaps so that it becomes more flexible. We have functions, objects, modules… but the actual links between them still needs to be shaped.
A library can have some influence on the shape, but it is minor if you favor the solution over the means. But sometimes you see people really going hard to fill the gaps of the shape, and that’s when you start to shout KISS and YAGNI. Sometimes they want to alter the shape and you bring out SOLID and other principles…
"I’m talking more about architecting code instead of naively writing them."
Yeah, we are talking about code designing.
And I got my head filled with all the design patterns back then in university, but my first bigger real world projects were somehow horribly overengineered and still unflexible. And I don't think it was just lack of experience.
Nowdays I prefer a very, very simple and clear approach.
No dark empty space I want to design around.
No clever hidden layers, that prevent the introduction of a pragmatic new API.
I guess I get what you probably mean and it ain't that, but to me it has too much of the vibe of the time when I was amazed at myself for coming up with a seemingly super clever (complex) design, that sounded great in theory.
Yes simplicity is always important, but it does not equate easiness. The axe of simple to complex is independent of the axe of easy to hard. It may be easy to apply patterns blindly to your codebass and make it complex. Just how it is easy to write naive and simple code that then becomes difficult to work with.
The mark of a good programmer is to balance all of these so that it’s easy to work with the codebase on an ongoing basis. And more often than not it’s similar to the sketching process. At each stage, you get enough feedback to judge the right direction for the next iteration. You do not start with all the details, nor with careless doodling. But one aspect that is often overlooked with artists is how often they practice to get that judgement capability.
"At each stage, you get enough feedback to judge the right direction for the next iteration."
Depends on the project I would say. What do you do, if all of a sudden the requirements change again? Or the plattform evolved/degraded? Then you compromise - and I can better compromise with simple solution. And I would never claim simple equals easy. Rather the opposite. Like you said, it is easy to make complex things. Also I never applied design patterns for the sake of it(even though it might have sounded like it) KISS was part of the theories as well.. but I did value emphasized cleverness too much as I thought that this is the way it is supposed to be done.
My resume is: simple direct solutions are to be prefered and trying to be clever is not very clever.
I rather have 3 lines of code, than one compressed clever one, no one can understand the first time reading it. And the same goes for the bigger design picture.