Kotlin creator's new language: talk to LLMs in specs, not English
codespeak.dev315 points by souvlakee 4 days ago
315 points by souvlakee 4 days ago
> It’s one of those things that crackpots keep trying to do, no matter how much you tell them it could never work. If the spec defines precisely what a program will do, with enough detail that it can be used to generate the program itself, this just begs the question: how do you write the spec? Such a complete spec is just as hard to write as the underlying computer program, because just as many details have to be answered by spec writer as the programmer.
Joel Spolsky, stackoverflow.com founder, Talk at Yale: Part 1 of 3 https://www.joelonsoftware.com/2007/12/03/talk-at-yale-part-...
Program generation from a spec meant something vastly different in 2007 than it does now. People can and are generating programs from underspecified prompts. Trying to be systematic about how prompts work is a worthwhile area to explore.
I don't see how it's different. You could always describe what you want to a team lead or consultant and pay them to build it.
That's still the best way to turn a spec into a program and comes with all the downsides it entails.
Sure, but Joel isn't saying that's impossible or that people who do that are crackpots. In fact, he was an advocate of writing specs ahead of time [1] - for people.
At the time "generating a program from a spec" was an idea floating around that you could come up with a "spec language" that was easier than regular programming languages but somehow still had the same power and could be compiled directly into a program. That's the crackpot idea that Joel is referencing - but that's not what a spec language used with an LLM is doing.
[1]: https://www.joelonsoftware.com/2000/10/02/painless-functiona...
This is an excellent observation and puts into words something I have barely scratched the surface of. Along with specifications, formal verification is another domain that received the "just automate it" treatment in the before times.
And because formal verification with LLMs is an active area of open research, I have some hope that the old idea of automated formal verification is starting to take shape. There is a lot to talk about here, but I'll leave a link to the 1968 NATO Software Engineering Conference [1] for those who are interested in where these thoughts originated. It goes deeply into the subject of "specification languages" and other related concepts. My understanding is that the historical split between computing science and software engineering has its roots in this 1968 conference.
[1]: http://homepages.cs.ncl.ac.uk/brian.randell/NATO/nato1968.PD...
I've built an AI compiler that has my take on this: https://github.com/jfilby/intentcode
Why don't you simply point your agent to your Jira tickets? It's easier than paying another third party for their magic LLM incantation loop.
Isn't that how it always goes? First its just the crackpots. Then it's a fringe. Soon it's the way things have always been done.
Might look like it, might also just be survivorship bias. Alot of crackpot ideas hit the wall instead of beeing a success. We only notice the successors and might think of them as the default, not the exception.
I was commenting from that perspective, basically any thing we consider today to be “the way it’s done” was once something only crazy people did. I think maybe it was pg who said something like if you’re only working on safe things you’ll never have a breakthrough because if breakthroughs came from safe ideas then there would be more of them. I’m not saying every crazy idea changes the world but if you want to change the world you need a crazy idea.
It actually makes sense that code is becoming amorphous and we will no longer scale in terms of building out new features (which has become cheap), but by defining stricter and stricter behavior constraints and structural invariants.
Yeah, "what you're able to build" is no longer one of the most important things, "what you won't build" just became a lot more important.
I’m actually start seeking at work how people are writing skills in a very procedural manner. Something like:
First, collect the following information from user: …. Second, send http request to the following endpoint with the certain payload…. If server returned error - report back to user.
It makes me crack every time I see that kind of stuff. Why on Earth you won’t just write a script for that purpose? 10x faster, zero tokens burned, 100% deterministic.
> Why on Earth you won’t just write a script for that purpose?
- Because your bash-fu may not be good enough
- Because parts of the process may not be amenable to scripting, especially if they require LLMs
- Because the inputs to some steps are fuzzy enough that only an LLM can handle them
- etc...
That being said, yes, anything amenable to being turned into scripts should be.
import Mathlib
def Goldbach := ∀ x : ℕ, Even x → x > 2 → ∃ (y z: ℕ), Nat.Prime y ∧ Nat.Prime z ∧ x = y + z
A short specification for the proof of the Goldbach conjecture in Lean. Much harder to implement though. Implementation details are always hidden by the interface, which makes it easier to specify than produce. The Curry-Howard correspondence means that Joel's position here is that any question is as hard to ask as answer, and any statement as hard to formulate as it is to prove, which is really just saying that all describable statements are true.This argument is based on the notion of proof irrelevance – if a theorem is true, any proof is as good as any other. This is not the case for computer programs – two programs that implement the same specification may be very different in terms of performance, size, UI/UX, code maintainability, etc.
Performance and size can easily be added to any specification, maintainability is not a problem if you never have to maintain it, UI/UX are design issues not code issues. If you specify a UI, it will have the UX you want. We can already do UI creation with visual editors.
Non-obvious solution Joel wasn't thinking of pre-AI age: you obviously just invent a mind crystal to interpret the spec during the build process ;)
What he misses is that it's much easier to change the spec than the code. And if the cost of regenerating the code is low enough, then the code is not worth talking about.
Is it? If the spec is as detailed as the code would be? If you make a change to one part of the spec do you now have inconsistencies that the LLM is going to have to resolve in some way? Are we going to have a compiler, or type checker type tools for the spec to catch these errors sooner?
It IS a compiler. You might as well ask if the machine-language output of a C compiler is as detailed as the C code was.
To anticipate your objection: you can get over determinism now, or you can get over it later. You will get over it, though, if you intend to stay in this business.
> It IS a compiler.
What are you talking about? If an LLM is a compiler, then I'm a compiler. Are we going to redefine the meaning of words in order not to upset the LLM makers?
Originally, the word "computer" referred to a human being. See https://en.wikipedia.org/wiki/Computer_(occupation)
Over time, when digital computers became commonplace, the computing moved from the person to the machine. At this time, arguably the humans doing the programming of the machine were doing the work we now ask of a "compiler".
So yes, an LLM can be a compiler in some sense (from a high level abstract language into a programming language), and you too can be a compiler! But currently it's probably a good use of the LLM's time and probably not a good use of yours.
I don't know, having done a lot of completely pointless time-wasting staring at hex dumps and assembly language in my youth was a pretty darned good lesson. I say it's a worthwhile hobby to be a compiler.
But your point stands. There is a period beyond which doing more than learning the fundamentals just becomes toil.
> The horse is here to stay, but the automobile is only a novelty — a fad.
Advice given to Henry Ford’s lawyer, Horace Rackam, by an unnamed president of Michigan Savings Bank in 1903.
People literally specifying software into existence in 2026 gives this quote a vibe of "aerodynamically speaking, bumblebees cannot fly".
You can only specify software into existence if your idea of what you want it to look like is as vague as your specification. Sometimes this is the case, sometimes not.
As far as I can tell it's not a new language, but rather an alternative workflow for LLM-based development along with a tool that implements it.
The idea, IIUC, seems to be that instead of directly telling an LLM agent how to change the code, you keep markdown "spec" files describing what the code does and then the "codespeak" tool runs a diff on the spec files and tells the agent to make those changes; then you check the code and commit both updated specs and code.
It has the advantage that the prompts are all saved along with the source rather than lost, and in a format that lets you also look at the whole current specification.
The limitation seems to be that you can't modify the code yourself if you want the spec to reflect it (and also can't do LLM-driven changes that refer to the actual code), and also that in general it's not guaranteed that the spec actually reflects all important things about the program, so the code does also potentially contain "source" information (for example, maybe your want the background of a GUI to be white and it is so because the LLM happened to choose that, but it's not written in the spec).
The latter can maybe be mitigated by doing multiple generations and checking them all, but that multiplies LLM and verification costs.
Also it seems that the tool severely limits the configurability of the agentic generation process, although that's just a limitation of the specific tool.
> The limitation seems to be that you can't modify the code yourself if you want the spec to reflect it
Eventually, we'll end up in a world where humans don't need to touch code, but we are not there yet. We are looking into ways to "catch up" the specs with whatever changes happen in the code not through CodeSpeak (agents or manual changes or whatever). It's an interesting exercise. In the case of agents, it's very helpful to look at the prompts users gave them (we are experimenting with inspecting the sessions from ~/.claude).
More generally, `codespeak takeover` [1] is a tool to convert code into specs, and we are teaching it to take prompts from agent sessions into account. Seems very helpful, actually.
I think it's a valid use case to start something in vibe coding mode and then switch to CodeSpeak if you want long-term maintainability. From "sprint mode" to "marathon mode", so to speak
> Eventually, we'll end up in a world where humans don't need to touch code, but we are not there yet.
Will we though? Wouldn't AI need to reach a stage where it is a tool, like a compiler, which is 100% deterministic?
Two things to mention here:
1. You are right that we can redefine what is code. If code is the central artefact that humans are dealing with to tell machines and other humans how the system works, then CodeSpeak specs will become code, and CodeSpeak will be a compiler. This is why I often refer to CodeSpeak as a next-level programming language.
2. I don't think being deterministic per se is what matters. Being predictable certainly does. Human engineers are not deterministic yet people pay them a lot of money and use their work all the time.
>Human engineers are not deterministic yet people pay them
Human carpenters are not deterministic yet they won't use a machine saw that goes off line even 1% of the time. The whole history of tools, including software, is one of trying to make the thing do more precisely what is intended, whether the intent is right or not.
Can you imagine some machine tool maker making something faulty and then saying, "Well hey, humans aren't deterministic."
We will and soon because it does not have to be deterministic like a compiler. It only has to pass all tests.
Who is writing the tests?
There are different kinds of tests:
* regression tests – can be generated
* conformance tests – often can be generated
* acceptance tests – are another form of specification and should come from humans.
Human intent can be expressed as
* documents (specs, etc)
* review comments, etc
* tests with clear yes/no feedback (data for automated tests, or just manual testing)
And this is basically all that matters, see more here: https://www.linkedin.com/posts/abreslav_so-what-would-you-sa...
Compiler is not 100% deterministic. Its output can change when you upgrade its version, its output can change when you change optimization options. Using profile-guided optimization can also change between runs.
If you change inputs then obviously you will get a different output. Crucially using the same inputs, however, produces the same output. So compilers are actually deterministic.
This is irrelevant over the long run because the environment changes even if nothing else does. A compiler from the 1980's still produces identical output given the original source code if you can run it. Some form of virtualization might be in order, but the environment is still changing while the deterministic subset shrinks.
Having faith that determinism will last forever is foolish. You have to upgrade at some point, and you will run into problems. New bugs, incompatibilities, workflow changes, whatever the case will make the determinism property moot.
Many compilers aren't deterministic. That's why the effort to make Linux distros have reproducible builds took so long and so much effort.
The reason is, it's often more work to be deterministic than not deterministic, so compilers don't do it. For example, they may compile functions in parallel and append them to the output in the order they complete.
Why are we eliminating our own job and maybe hobby so eagerly? Whatever. It is done.
As far as I can tell C is not a new language, but rather an alternative workflow for assembly development along with a tool that implements it.
Also they seem to want to run this as a business, which seems absurd to me since I don't see how they can possibly charge money, and anyway the idea is so simple that it can be reimplemented in less than a week (less than a day for a basic version) and those alternative implementations may turn out to be better.
It also seems to be closed-source, which means that unless they open the source very soon it will very likely be immediately replaced in popularity by an open source version if it turns out to gain traction.
I think these limitations could be addressed by allowing trivial manual adjustments to the generated code before committing. And/or allowing for trivial code changes without a spec change. The judgement of "trivial" being that it still follows the spec and does not add functionality mandating a spec change. I haven't checked if they support any of this but I would be frustrated not being allowed to make such a small code change, say to fix an off-by-one error that I recently got from LLM output. The code change would be smaller than the spec change.
Cool idea overall, an incremental psuedocode compiler. Interesting to see how well it scales.
I can also see a hybrid solution with non-specced code files for things where the size of code and spec would be the same, like for enums or mapping tables.
Also a bit formal. Maybe something like this will be the output of the prompt to let me know what the AI is going to generate in the binary, but I doubt I will be writing code like this in 5 years, English will probably be fine at my level.
> Also it seems that the tool severely limits the configurability of the agentic generation process, although that's just a limitation of the specific tool.
Working on that as well. We need to be a lot more flexible and configurable
This doesn't make too much sense to me.
* This isn't a language, it's some tooling to map specs to code and re-generate
* Models aren't deterministic - every time you would try to re-apply you'd likely get different output (without feeding the current code into the re-apply and let it just recommend changes)
* Models are evolving rapidly, this months flavour of Codex/Sonnet/etc would very likely generate different code from last months
* Text specifications are always under-specified, lossy and tend to gloss over a huge amount of details that the code has to make concrete - this is fine in a small example, but in a larger code base?
* Every non-trivial codebase would be made up of of hundreds of specs that interact and influence each other - very hard (and context - heavy) to read all specs that impact functionality and keep it coherent
I do think there are opportunities in this space, but what I'd like to see is:
* write text specifications
* model transforms text into a *formal* specification
* then the formal spec is translated into code which can be verified against the spec
2 and three could be merged into one if there were practical/popular languages that also support verification, in the vain of ADA/Spark.
But you can also get there by generating tests from the formal specification that validate the implementation.
Models aren't deterministic - every time you would try to re-apply you'd likely get different output (without feeding the current code into the re-apply and let it just recommend changes)
If the result is always provably correct it doesn't matter whether or not it's different at the code level. People interested in systems like this believe that the outcome of what the code does is infinity more important than the code itself.
That if at the beginning of your sentence is doing a whole lot of work. Indeed, if we could formally and provably (another extremely loaded word) generate good code that'd be one thing, but proving correctness is one of those basically impossible tasks.
> but proving correctness is one of those basically impossible tasks.
To aim for a meeting of the minds... Would you help me out and unpack what you mean so there is less ambiguity? This might be minor terminological confusion. It is possible we have different takes, though -- that's what I'm trying to figure out.
There are at least two senses of 'correctness' that people sometimes mean: (a) correctness relative to a formal spec: this is expensive but doable*; (b) confidence that a spec matches human intent: IMO, usually a messy decision involving governance, organizational priorities, and resource constraints.
Sometimes people refer to software correctness problems in a very general sense, but I find it hard to parse those. I'm familiar with particular theoretical results such as Rice's theorem and the halting problem that pertain to arbitrary programs.
* With tools like {Lean, Dafny, Verus, Coq} and in projects like {CompCert, sel4}.
Let's rephrase:
Since nobody involved actually cares whether the code works or not, it doesn't matter whether it's a different wrong thing each time.
You got it completely backwards. The claim is that if the code does exactly what the spec says (which generated tests are supposed to "prove") then the actual code does not matter, even if it's different each time.
The point they are making is the tests are neither necessary nor sufficient alone to prove the code does exactly what the spec says. Looking at the tests isn't enough to prove anything; as an extreme example, if no one involved looks at the code, then the tests can just be static always passing and you wouldn't know either way whether or not the code matches the spec or not.
If anyone cared enough they could look at the code and see the problem immediately and with little effort, but we're encouraging a world where no one cares enough to put even that baseline effort because *gestures at* the tests are passing. Who cares how wrong the code is and in what ways if all the lights are green?
> If the result is always provably correct it doesn't matter whether or not it's different at the code level. People interested in systems like this believe that the outcome of what the code does is infinity more important than the code itself.
If the spec is so complete that it covers everything, you might as well write the code.
The benefit of writing a spec and having the LLM code it, is that the LLM will fill in a lot of blanks. And it is this filling in of blanks that is non-deterministic.
> If the spec is so complete that it covers everything, you might as well write the code.
Welcome to the usual offshoring experience.
Besides, you can deterministically generate bad code, and not deterministically generate good code.
The code is what the code does.
The shoe is what the shoe does.
Except one shoe is made by children in a fire-trap sweatshop with no breaks, and the other was made by a well paid adult in good working conditions.
The ends don’t justify the means. The process of making impacts the output in ways that are subtle and important, but even holding the output as a fixed thing - the process of making still matters, at least to the people making it.
The end is whether the code meets the functional and non functional requirements.
And guess how much shoe companies make who manufacture shoes in sweatshop conditions versus the ones who make artisanal handcrafted shoes?
Functional requirements are known knowns.
Out of bounds behavior is sometimes a known unknown, but in the era of generated code is exclusively unknown unknowns.
Good luck speccing out all the unanticipated side effects and undefined behaviors. Perhaps you can prompt the agent in a loop a bnumber of times but it's hard to believe that the brute-force throw-more-tokens-at-it approach has the same level of return as a more attentive audit by human eyeballs.
Are you as a developer 100% able to trust that you didn’t miss anything? Your team if you are a team lead who delegates tasks to other developers? If you outsource non business things like Salesforce integrations etc do you know all of the code they wrote? Your library dependencies? Your infrastructure providers?
It seems like ^ and ^^ agree to me. Am I missing something?
I don’t know. I’m making a point that the only people whose sole responsibility is code that they personally write are mid level ticket takers.
I don’t review every line of code by everyone whose output I’m responsible for, I ask them to explain how they did things and care about their testing, the functional and non functional requirements and hotspots like concurrency, data access patterns, architectural issues etc.
For instance, I haven’t done web development since 2002 except for a little copy and paste work. I completely vibe coded three internal web admin sites for separate projects and used Amazon Cognito for authentication. I didn’t look at a line of code that AI generated any more than I would have looked at a line of code for a website I delegated to the web developer. I cared about functionality and UX.