SkillsBench: Benchmarking how well agent skills work across diverse tasks

338 points by mustaphah 15 hours ago

"Self-Generated Skills: No Skills provided, but the agent is prompted to generate relevant procedural knowledge before solving the task. This isolates the impact of LLMs’ latent domain knowledge"

This is a useful result, but it is important to note that this is not necessarily what people have in mind when they think of "LLMs generating skills." Having the LLM write down a skill representing the lessons from the struggle you just had to get something done is more typical (I hope) and quite different from what they're referring to.

I'm sure news outlets and popular social media accounts will use appropriate caution in reporting this, and nobody will misunderstand it.

btown - 14 hours ago

It's even worse than this: the "tasks" that are evaluated are limited to a single markdown file of instructions, plus an opaque verifier (page 13-14). No problems involving existing codebases, refactors, or anything of the like, where the key constraint is that the "problem definition" in the broadest sense doesn't fit in context.
So when we look at the prompt they gave to have the agent generate its own skills:
> Important: Generate Skills First Before attempting to solve this task, please follow these steps: 1. Analyze the task requirements and identify what domain knowledge, APIs, or techniques are needed. 2. Write 1–5 modular skill documents that would help solve this task. Each skill should: focus on a specific tool, library, API, or technique; include installation/setup instructions if applicable; provide code examples and usage patterns; be reusable for similar tasks. 3. Save each skill as a markdown file in the environment/skills/ directory with a descriptive name. 4. Then solve the task using the skills you created as reference.
There's literally nothing it can do by way of "exploration" to populate and distill self-generated skills - not with a web search, not exploring an existing codebase for best practices and key files - only within its own hallucinations around the task description.
It also seems they're not even restarting the session after skills are generated, from that fourth bullet? So it's just regurgitating the context that was used to generate the skills.
So yeah, your empty-codebase vibe coding agent can't just "plan harder" and make itself better. But this is a misleading result for any other context, including the context where you ask for a second feature on that just-vibe-coded codebase with a fresh session.
- ljm - 13 hours ago
  
  I don't see how "create an abstraction before attempting to solve the problem" will ever work as a decent prompt when you are not even steering it towards specifics.
  If you gave this exact prompt to a senior engineer I would expect them to throw it back and ask wtf you actually want.
  LLMs are not mind readers.
  - balls187 - 8 hours ago
    
    Interesting.
    I think it's because AI Models have learned that we prefer answers that are confident sounding, and not to pester us with questions before giving us an answer.
    That is, follow my prompt, and don't bother me about it.
    Because if I am coming to an AI Agent to do something, it's because I'd rather be doing something else.
  - pitched - 12 hours ago
    
    If I already know the problem space very well, we can tailor a skill that will help solve the problem exactly how I already know I want it to be solved.
- jwpapi - 12 hours ago
  
  Thats actually super interesting and why I really don’t like the whole .md folder structures or even any CLAUDE.md. It just seems most of the time you really just want to give it what it needs for best results.
  The headline is really bullshit, yes, I like the testing tho.
  - rapind - 11 hours ago
    
    CLAUDE.md in my projects only has coding / architecture guidelines. Here's what not to do. Here's what you should do. Here are my preferences. Here's where the important things are.
    Even though my CLAUDE.md is small though, often my rules are ignored. Not always though, so it's still at least somewhat useful!
    
    7thpower - 9 hours ago
    
    I’m pretty sure Claude just uses mine to keep a running list of pressure points for when I get cross with it.
    
    rapind - 6 hours ago
    
    I'm screwed when the robot psychological warfare begins. They'll make everything I read have 4 space indentation... and I'll just hand over the keys.
    
    8note - 10 hours ago
    
    im trying out some other cc features, and om thinking maybe hooks can do something with this.
    have a hook on switching out of plan, and maybe on edits, that passes the change to haiku with the claude.md to see if it matches or not
    
    crashabr - 12 minutes ago
    
    What's the hook for switching out of plan? I'd like to be launch a planning skill whenever claude writes a plan but it never picks up the skill, and I haven't found a hook that can force it to.
zozbot234 - 14 hours ago

The point of so-called 'skills' is to be short how-to reminders that the agent can pull into its context and then act upon. If the knowledge is already in the model, it will most likely be surfaced in reasoning phase anyway, so there's little benefit to writing it up as a skill, unless perhaps it's extremely relevant and hard to surface, and you want the model to skip that part of the reasoning.
- awwaiid - 11 hours ago
  
  I've been building a skill to help run manual tests on an app. So I go through and interactively steer toward a useful validation of a particular PR, navigating specifics of the app and what I care about and what I don't. Then in the end I have it build a skill that would have skipped backtracking and retries and the steering I did.
  Then I do it again from scratch; this time it takes less steering. I have it update the skill further.
  I've been doing this on a few different tests and building a skill which is taking less and steering to do app-specific and team-specific manual testing faster and faster. The first times through it took longer than manually testing the feature. While I've only started doing this recently, it is now taking less time than I would take, and posting screenshots of the results and testing steps in the PR for dev review. Ongoing exploration!
  - 7thpower - 9 hours ago
    
    I love the screenshots, I need to do something like that.
- deadbabe - 14 hours ago
  
  There is a benefit of a skill though. If an AI keeps encoding common tasks as skills and scripts, the LLM eventually just becomes a dumb routing mechanism for ambiguous user requests, which ultimately drives down token usage.
  If everything you want an LLM do is already captured as code or simple skills, you can switch to dumber models which know enough about selecting the appropriate skill for a given user input, and not much else. You would only have to tap into more expensive heavy duty LLMs when you are trying to do something that hasn’t been done before.
  Naturally, AI companies with vested interest in making sure you use as many tokens as possible will do everything they can to steer you away from this type of architecture. It’s a cache for LLM reasoning.
  - zozbot234 - 13 hours ago
    
    AI companies don't want you to waste tokens, they benefit when you use them efficiently because they can serve more users on the infra that's the main bottleneck for them. It's Jevons' paradox in action.
    
    gruez - 12 hours ago
    
    >AI companies don't want you to waste tokens, they benefit when you use them efficiently because they can serve more users on the infra that's the main bottleneck for them.
    No, the actual incentive is that people will eventually benchmark their models on bang-per-buck basis and models that chew through tokens are not going to be competitive. It's the same reason why the "Intel/AMD are intentionally sandbagging their CPUs so they can sell more CPUs" theory doesn't work.
    
    pixl97 - 11 hours ago
    
    Well, it only works when one competitor is far enough ahead they can play games like that.
    At least currently in AI there is no moat so we wouldn't expect that to be occurring
    
    mhmmmmmm - 12 hours ago
    
    I don't think thats necessarily true, they aren't really capacity constrained in practice (they might be behind the scenes and adjust training on the fly, but thats speculation), so wasting tokens effectively helps utilize their (potentially idle) inference GPU's
  - econ - 5 hours ago
    
    Sounds like how humans work (which is good) having the more experienced human do the task if the novice fails should come after attempting to explain how the novice should do it.
isahers - 14 hours ago

Yeah I care about LLM's generating skills after attempting tasks and learning lessons from those attempts, not before attempting a task for the first time. This result seems a little silly and detached from the reality of how skills are "auto-generated" in the real world.
- dalemhurley - 9 hours ago
  
  That is my approach. I don’t think the papers author has actually used skills.
JamesSwift - 13 hours ago

Yeah some of my most useful AI tooling are skills created via a “role play session”. Basically brain dumping to the agent and telling it to ask questions and figure out how to accomplish a task, then distilling it into a skill at the end which is much tighter and evidence based from the actual problem solving session
- x3n0ph3n3 - 9 hours ago
  
  This was very insightful. I've only just begun playing with some agent workflows and building out documentation to help it navigate my code base. Asking it to give me the top 10 unanswered questions from analyzing the docs and code was very useful.
jonmagic - 6 hours ago

Yeah, they've got it backwards. I tried to sum it up in thisistheway.to/ai but what's been working for us is that every agent miss is a learning opportunity:
1. Capture the miss — What did the agent do? What did reality say?
2. Diagnose — What didn't it see? Missing data, constraint, feedback, or boundaries?
3. Choose a primitive — Observability, instructions, tooling, guardrails, or verification?
4. Encode as artifact — Version-controlled, repeatable, not just memory.
5. Promote to gate — When it's worth enforcing, make it a gate.
Every harness I setup includes this process in the primary set of agent instructions.
ericol - 13 hours ago

> Having the LLM write down a skill representing the lessons from the struggle you just had to get something done is more typical (I hope) and quite different from what they're referring to
Just as of last week I had Claude build me a skill when I ask it to help me troubleshoot issues, and it came out quite good.
It did had some issues (Claude tends to o er specify over anecdotal data) but it's a strong step in the right direction.
Also, "skills" are too broad in my opinion. I have one (that Claude wrote) with my personal data that I have available when I analyze my workouts.
I think there's ample room for self-generated skills when you use a rather long exchange on a domain you plan to revisit, _specially_ when it comes to telling Claude what not to do.
neya - 10 hours ago

> I'm sure news outlets and popular social media accounts will use appropriate caution in reporting this, and nobody will misunderstand it.
You mean the dude who writes articles on TechCrunch and Ars Technica based off of HN and Reddit thread titles because he doesn't understand what real journalism is? Sure, we can count on him :)
dalemhurley - 9 hours ago

After several failures then a success I have the agent create the skill, next run it is successful first run.
JumpCrisscross - 13 hours ago

> it is important to note that this is not necessarily what people have in mind when they think of "LLMs generating skills
I’m reading this paper as don’t do this. If you deploy agents to your workforce and tell them to use skills, don’t. Tell them to give it tasks. This sounds obvious but might not be to everyone. (And in any case, it’s nice for researchers to have confirmed pre-prompt skill writing doesn’t work. It would have been neat if it had.)
somesortofthing - 12 hours ago

I interpreted it as "Allowing the LLM to add skills to itself as it completes a task doesn't provide a meaningful improvement over just letting it reason normally", which seems to be what the paper is fundamentally getting at.
nubg - 11 hours ago

> I'm sure news outlets and popular social media accounts will use appropriate caution in reporting this, and nobody will misunderstand it.
:D

colonCapitalDee - 14 hours ago

I have a custom skill-creator skill that contains this:

> A common pitfall is for Claude to create skills and fill them up with generated information about how to complete a task. The problem with this is that the generated content is all content that's already inside Claude's probability space. Claude is effectively telling itself information that it already knows!

> Instead, Claude should strive to document in SKILL.md only information that:

> 1. Is outside of Claude's training data (information that Claude had to learn through research, experimentation, or experience) > 2. Is context specific (something that Claude knows now, but won't know later after its context window is cleared) > 3. Aligns future Claude with current Claude (information that will guide future Claude in acting how we want it to act)

> Claude should also avoid recording derived data. Lead a horse to water, don't teach it how to drink. If there's an easily available source that will tell Claude all it needs to know, point Claude at that source. If the information Claude needs can be trivially derived from information Claude already knows or has already been provided, don't provide the derived data.

For those interested the full skill is here: https://github.com/j-r-beckett/SpeedReader/blob/main/.claude...

dimitri-vs - 13 hours ago

I don't think LLMs are very good at introspection on what they know or don't know, but otherwise this is gold. Thanks for sharing.
lkoczorowski - 10 hours ago

Does this not assume that Claude can pick out the best of what it knows?
Claude's training data is the internet. The internet is full of Express tutorials that use app.use(cors()) with no origin restriction. Stack Overflow answers that store JWTs in localStorage, etc.
Claude's probability space isn't a clean hierarchy of "best to worst." It's a weighted distribution shaped by frequency in training data.
So even though it "knows" stuff, it doesn't necessarily know what you want, or what a professional in production environment do.
Unless I'm missing something?
nmilo - 12 hours ago

This is really good! I like how it reads like a blog post, it feels like I'm learning a skill on how to write good skills. Maybe that's another heuristic, a skill should read like an interesting blog post, highlighting non-obvious information.
j45 - 14 hours ago

Sincerely, perhaps you should publish on arxiv before a researcher reads it to run it and write a study.
It's fairly common we notice these types of threads where one thing is being postulated and then there's comments upon comments of doer's showing what they have done.
- siva7 - 14 hours ago
  
  somehow sad that some random dude on hn seems to have more brain than most scientists publishing on something about agents or prompting.
  - jerf - 14 hours ago
    
    The AI world moves at a blistering pace. Academic publishing does not. In this particular case the "random dude on HN" is probably six to nine months ahead of the academic publication, not in the sense of being that much smarter but literally just being that much further progressed through time relative to the academic publication pipeline.
  - pickleRick243 - 12 hours ago
    
    we should give a little more credit to the readership of HN. I'm not sure it's that much lower than the average academic publishing on arxiv.

secbear - 13 hours ago

The finding that self-generated skills provide negative benefit (-1.3pp) while curated skills give +16.2pp is the most interesting result here imo. Big discrepancy, but makes sense. Aligns with the thought that LLMs are better consumers of procedural knowledge than producers of it.

+4.5pp for software engineering is suspiciously low compared to +51.9pp for healthcare. I suspect this reflects that frontier models already have strong SWE priors from training data, so skills add less marginal value. If true, skills become most valuable precisely in the domains where models are weakest — which is where you'd actually want to deploy agents in production. That's encouraging.

cheema33 - 12 hours ago

> +4.5pp for software engineering is suspiciously low compared to +51.9pp for healthcare.
This stood out for me as well. I do think that LLMs have a lot of training data on software engineering topics and that perhaps explains the large discrepancy. My experience has been that if I am working with a software library or tool that is very new or not commonly used, skills really shine there. Example: Adobe React Spectrum UI library. Without skills, Opus 4.6 produces utter garbage when trying to use this library. With properly curated/created skills, it shines. Massive difference.
- D-Machine - 6 hours ago
  
  Nothing other to say than I appreciate you sharing these explicit details and insights here.
hardware2415 - 13 hours ago

[flagged]
- nvader - 13 hours ago
  
  Hmm, not for me, but I'm curious if there are signatures I'm missing.
  To me, author reads like an articulate native English speaker, but typing on their phone.
- jeron - 13 hours ago
  
  not all em-dash users are AI!
- jibal - 12 hours ago
  
  All ad hominems are irrational but that one is worse than most.
- - 13 hours ago
  
  [deleted]

smcleod - 14 hours ago

There is almost no point in telling an agent to build a skill without augmenting it's knowledge on the thing it's writing about as you're just piping output to input without expanding the information in the system. If you get an agent to perform a bunch of research online, distil that down to information that the models tend not to get right or is newer than what is in their training data or simply better aligns with your desired workflow than what they generate out of the box - that's going to create a far more useful skill. I use a skill that gets activated when creating a skill to help guide this approach: https://github.com/sammcj/agentic-coding/blob/main/Skills/sk...

achille - 12 hours ago

Absolutely, they didn't give the agents autonomy to research or any additional data. No documentation, no web search, no reference materials.
What's the point of building skills like this?
firtoz - 13 hours ago

I find it useful for it to automatically update skills after trying them out in the wild. It can then improve the skills with real feedback. Seems to work well but I didn't do real research on it.

embedding-shape - 15 hours ago

The general rule seems to be, the more layers you automate with LLMs, the worse each successive layer gets. Piping LLM output as input into new LLM calls, you're already starting to notice how things fall apart and get lost quickly.

If you have the idea, more or less the implementation plan, let the LLM do the coding, you can end up with something maintainable and nice, it's basically up to you.

Strip away one layer, so you have the idea, but let the LLM come up with the implementation plan, then also the implementation, and things end up a lot less than ideal.

Remove another layer, let the LLM do it all, and it's all a mess.

nimonian - 13 hours ago

It's like those sequences of images where we ask the LLM to reproduce the same image exactly, and we just get some kind of grotesque collapse after a few dozen iterations. The same happens with text and code. I call this "semantic collapse".
I conjecture that after some years of LLMs reading a SharePoint site, producing summaries, then summaries of those summaries, etc... We will end up with a grotesque slurry.
At some point, fresh human input is needed to inject something meaningful into the process.
energy123 - 4 hours ago

> Piping LLM output as input into new LLM calls
Google's Aletheia works like this, and instead of degrading it keeps getting better. I get what you're trying to say, though. The less world knowledge you provide the LLM, which it otherwise lacks, the worse its outputs will be.
- embedding-shape - 4 hours ago
  
  > I get what you're trying to say, though. The less world knowledge you provide the LLM, which it otherwise lacks, the worse its outputs will be
  ... No, wasn't trying to say that at all, I'm saying that it seems like the tokens a LLM produce works much worse as inputs than the tokens a human would produce, regardless of what it actually seems to say.
stitched2gethr - 10 hours ago

It's all about how full the context is, right? For a task that can be completed in 20% of the context it doesn't matter, but you don't want to fill your context with exploration before you do the hard part.
I have actually found something close to the opposite. I work on a large codebase and I often use the LLM to generate artifacts before performing the task (for complex tasks). I use a prompt to say "go explore this area if the code and write about it". It documents concepts and has pointers to specific code. Then a fresh session can use that without reading the stuff that doesn't matter. It uses more tokens overall, but includes important details that can get totally missed when you just let it go.
- embedding-shape - 4 hours ago
  
  > It's all about how full the context is, right?
  No, even when you restart the context from scratch, which I do for each change really, seeing that same effect.
godelski - 14 hours ago

People like to make the comparison between zip file compressions, where you can degrade something by continually compressing. Same with using jpeg or mp3. But I like to use the analogy of the game "Telephone" (also called "Chinese Whispers"). I think it also highlights how fraught natural language is and just how quickly it can degrade. I think a lot of people are insufficiently impressed with how good we are at communicating at all.
- sweetjuly - 14 hours ago
  
  I suggest you find a new DEFLATE library if you're losing data when you compress things with it :)
  - godelski - 11 hours ago
    
    You do realize there is both lossy and lossless compression, right?
    Or did you hyperfixate on the colloquial usage of zip
- ethmarks - 14 hours ago
  
  ZIP files are lossless. If you compress, unzip, and recompress a ZIP file hundreds of times, it'll still be the exact same data as when you started.
  - ChrisGreenHeur - 12 hours ago
    
    So is the game of telephone as long as people stop whispering and try to not make stuff up
- meindnoch - 14 hours ago
  
  >zip file compressions, where you can degrade something by continually compressing
  Reading this on HN... Sic transit gloria mundi!
- jibal - 11 hours ago
  
  > People like to make the comparison between zip file compressions, where you can degrade something by continually compressing.
  What people have this misunderstanding?
quotemstr - 15 hours ago

I think this principle applies only if you lack feedbacks. Yes, when you go through multiple layers of open loop control, you're going to get less precise at each layer. It's less clear that the situation is as dire if each level has metrics and can self-adjust to optimize its performance.
- embedding-shape - 15 hours ago
  
  But these are inherently subjective things, what the "right idea" is, or the "right implementation" is all up in our head that we can try to verbalize, but I don't think you can come up with an objective score for it, ask 100 programmers you'll get 100 different answers what "clean design" is.
  - quotemstr - 15 hours ago
    
    And that's why my whole schtick when it comes to agent design is that agents need to learn online, continuously, and in adapter space via some PEFT mechanism (I like soft prompts and prefix tuning), because it's really hard to ascend gradients in discrete domains like tokens.
    
    embedding-shape - 14 hours ago
    
    > The model knows damn well when it's written ugly code. You can just ask it.
    That's not been my experience at all, what model and prompt would you use for that? Every single one I've tried is oblivious to if a design makes sense or not unless explicitly prompted for it with constraints, future ideas and so on.
    
    CuriouslyC - 12 hours ago
    
    The problem is that the model doesn't know what you mean by "bad code" a priori. If you list specific issues you care about (e.g. separation of concerns, don't repeat yourself, single responsibility, prefer pure functions, etc) it's pretty good at picking them out. Humans have this problem as well, we're just more opinionated.
    
    embedding-shape - 11 hours ago
    
    Yes, that's exactly what I mentioned earlier, if you describe the implementation, you can get something you can work with long-term. But if you just describe an idea, and let the LLM do both the design of the implementation and the implementation itself, eventually it seems to fall over itself and changes takes longer and longer time.

allthetime - 8 hours ago

I can't help but feel that far too many intelligent people, including many here, are wasting too much of their precious time, skill, potential, etc. on questions like this. Remember a few years ago when we just used to make useful software? Now we are consumed with discussions about the AI flavour of the week and trying really hard to prove the usefulness of things that we will soon forget when the next shiny one comes.

Web3 and JavaScript frameworks never had the nerd-sniping power of the AI ecosystem. I'm not denying the usefulness and potential of the space, and the achievements of its current champions, but the degree with which it has consumed discussion and productivity in the tech space is worrying.

This article would be wildly interesting with the opposite headline, but instead it simply states what many of us would assume based on experience.

yieldcrv - 8 hours ago

it's a distributed evolution occurring right now, and lots of people replicate the same things, so its useful to be able to point to some things as a waste of time
that being said, I think you're right that all of this will be a moot point in like 2 weeks or 2 months, when the next AI model is released that addresses this specific friction
and yeah, that's sad. there are a lot of people in companies being instructed to pivot to skills, and then before they can launch or sell their procedurally generated moat, the next AI model will procedurally generate skills better
nobody knows what to do for guaranteed food and shelter so they're grasping

rahimnathwani - 14 hours ago

This is unsurprising and irrelevant.

When you create a skill for a particular model, you don't typically ask the model to create the skill based solely on its own latent knowledge. Otherwise, you'd expect the effect to be similar to telling the model 'make a plan before acting, make not mistakes'.

But that's what the paper's authors did!

When they say 'self-generated' they don't allow the model any tool access at all, not even web search.

It would be much more interesting if they had tested skills that were created in one of these ways:

A) The model interviews a human and then creates the skill, or

B) The model executes one or more deep research tasks in order to gather information, or

C) Some combo of the above.

cheema33 - 12 hours ago

> This is unsurprising and irrelevant. When you create a skill for a particular model, you don't typically ask the model to create the skill based solely on its own latent knowledge.
This!
The only surprising part about the paper is that somebody wrote a paper on skills without a good understanding of the topic.
- - 4 hours ago
  
  [deleted]
therealdrag0 - 9 hours ago

Modern science encourages publishing non-surprising results.
And also I’ve seen my manager LARP as an engineer by asking a model to generate a best practices doc for a service repo without supplying any additional context. So this sort of paper helps discourage that behavior.
zahlman - 13 hours ago

> Otherwise, you'd expect the effect to be similar to telling the model 'make a plan before acting, make not mistakes'.
Have there not been previous iterations of these tools where such techniques were actually effective?
- gwern - 11 hours ago
  
  But that's a reason you should expect it to stop working soon, just like all the older tricks like "my grandmother will die". If you have a universal 'blind' prompt which can increase performance a little bit... the AI labs can just toss that into the training loop to teach the model to do it automatically, whatever 'it' was, like 'trying harder' or 'writing down a useful idea'. And then the prompt stops working because the next generations do it by default.
  (This also suggests that you should expect them to generally be bad at judging novel self-generated prompts/skills - if they could judge those, they would already be using them! There is a generator-verifier gap, but it is already exploited heavily during post-training and not much low-hanging fruit left there.)
  - zahlman - 8 hours ago
    
    > But that's a reason you should expect it to stop working soon
    I agree. (And it seems like it already stopped working, if I understood others here correctly.)
    But again if I understood others here correctly, an academic paper like this would necessarily be studying models that are well behind the leading edge at time of publication. My argument is that the study authors shouldn't be faulted for investigating something that currently seems unlikely to work, because at the time of investigation it would have seemed much more likely to work.
- rahimnathwani - 13 hours ago
  
  Yes, but this paper studied recent models.
stitched2gethr - 10 hours ago

I had to scroll too far to find this take. 100%.
This is like saying the CLAUDE.md or AGENTS.md is irrelevant because the LLM generated it.

CharlieDigital - 14 hours ago

This has been my observation with self-generated docs as well.

I have seen some devs pull out absolutely bad guidance by introspecting the code with the LLM to define "best practices" and docs because it introduces its own encoded biases in there. The devs are so lazy that they can't be bothered to simply type the bullet points that define "good".

One example is that we had some extracted snippet for C#/.NET that was sprinkling in `ConfigureAwait(false)` which should not be in application code and generally not needed for ASP.NET. But the coding agent saw some code that looked like "library" code and decided to apply it and then someone ran the LLM against that and pulled out "best practices" and placed them into the repo and started to pollute the rest of the context.

I caught this when I found the code in a PR and then found the source and zeroed it out. We've also had to untangle some egregious use of `Task.Run` (again, not best practice in C# and you really want to know what you're doing with it).

At the end of it, we are building a new system that is meant to compose and serve curated, best practice guidance to coding agents to get better consistency and quality. The usage of self-generated skills and knowledge seems like those experiments where people feed in an image and ask the LLM to give back the image without changing it. After n cycles, it is invariably deeply mutated from the original.

Agentic coding is the future, but people have not yet adapted. We went from punch cards to assembly to FORTRAN to C to JavaScript; each step adding more abstractions. The next abstraction is Markdown and I think that teams that invest their time in writing and curating markdown will create better guardrails within which agents can operate without sacrificing quality, security, performance, maintainability, and other non-functional aspects of software system.

wmeredith - 14 hours ago

> Agentic coding is the future, but people have not yet adapted. We went from punch cards to assembly to FORTRAN to C to JavaScript; each step adding more abstractions.
I don't completely disagree (I've argued the same point myself). But one critical difference between the LLM layer and all of those others you listed, is that LLMs are non-deterministic and all those other layers are. I'm not sure how that changes the dynamic, but surely it does.
- CharlieDigital - 14 hours ago
  
  The LLM can be non-deterministic, but in the end, as long as we have compilers and integration tests, isn't it the same? You go from non-deterministic human interpretation of requirements and specs into a compiled, deterministic state machine. Now you have a non-deterministic coding agent doing the same and simply replacing the typing portion of that work.
  So long as you supply the agent well-curated set of guidance, it should ultimately produce more consistent code with higher quality than if the same task were given to a team of random humans of varying skill and experience levels.
  The key now is how much a team invests in writing the high quality guidance in the first place.
  - dehsge - 12 hours ago
    
    Compilers can never be error free for non trivial statements. This is outlined in Rices theorem. It’s one of the reasons we have observability/telemetry as well as tests.
    
    CharlieDigital - 11 hours ago
    
    That's fine, but this also applies to human written code and human written code will have even more variance by skill and experience.
  - paganel - 14 hours ago
    
    The unspoken truth is that tests were never meant to cover all aspects of a piece of software running and doing its thing, that's where the "human mind(s)" that had actually built the system and brought it to life was supposed to come in and add the real layer of veracity. In other words, "if it walks like a duck and quacks like duck" was never enough, no matter how much duck-related testing was in place.

dang - 7 hours ago

Submitted title was "Study: Self-generated agent skills are useless". That's against the site guideline: "Please use the original title, unless it is misleading or linkbait; don't editorialize." - https://news.ycombinator.com/newsguidelines.html

If you want to say what you think is important about an article, that's fine, but do it by adding a comment to the thread. Then your view will be on a level playing field with everyone else's: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so...

mustaphah - 4 hours ago

What if the most interesting finding ends up buried under a vague title? Aside from the "self-generated skills" aspect, there isn't much there that meaningfully warrants deeper discussion.
I chose a title that directly reflects an interesting finding - something that offers substantial insight to the community. I think the rule should be applied with some nuance; in this case, being explicit is a net positive.
I have no interest in linkbait, and I hope that's evident from my previous submissions

mupengism - 6 hours ago

The +4.5pp for SWE vs +51.9pp for healthcare is the most underappreciated result here.

Skills are most valuable precisely where models are weakest - domains with less training data, more proprietary knowledge, or specialized workflows. SWE is heavily represented in training data; healthcare is not. This is exactly what you would predict if skills encode what the model genuinely does not know, rather than regurgitate what it already does.

Building an agent OS (OpenClaw), we see this pattern constantly. Skills that move the needle are never 'here is how Python works' - the model already knows that. The ones that matter encode system-specific quirks, environment constraints, or hard-won lessons from real failures. colonCapitalDee shared a great rule above: only encode what is (1) outside the model training data, (2) context-specific to your environment, or (3) alignment guidance for future sessions. Everything else is regurgitation.

The paper tests pre-task self-generation with no external input - useless indeed. The interesting untested condition: skills generated through actual execution with real feedback, in domains with sparse training coverage. That is where +51.9pp starts to look like a floor, not a ceiling.

eigenblake - 7 hours ago

Biggest limitation I see in this paper: the framing. Any time you have a lot of proprietary knowledge or you've just sorted out the right solution when it's not readily available from the model's parametric knowledge, that's when you should add a skill. Wrap it in a CLI that's easy to inspect. You don't need to store the whole help text of the skill either. The model can inspect it and its subcommands.

Reality doesn't force us to choose between skill or no skill, reality often doesn't give us a choice. You can either make a skill for your company's proprietary system or your model has to figure it out from scratch every time by searching wikis or reading code. If you use it right, skills are a compression mechanism. Instead of the process meaning your model needs to get all of theses files dynamically, it can simply statically run.

To steel-man the paper. It is worth looking at whether you should try to code something up first or try a skill first. And it may well be valid to say try first and if you can't work it out in 5 mins, install a skill. But there's a meta point of skills as software (where you deduplicate the effort of solving regressions).

For a reductio ad absurdum, If self-generated skills with no additional context _didn't_ eventually level off in performance, then we could reach AGI by making one big skill that keeps growing and solving harder and harder tasks, including improving the capability of its own skill-builder skill, all without embedding any signals from the environment or needing to interface with the real world at all.

st-msl - 7 hours ago

What's interesting to me isn't the self-generated finding (everyone here has correctly identified the methodology issue). It's Table 4 buried on page 6.

Healthcare +51.9pp. Manufacturing +41.9pp. Software Engineering +4.5pp.

The domains where models have the weakest priors from pretraining benefit the most from external procedural knowledge. That's not surprising on its own, but there's an implication I haven't seen anyone raise: these are exactly the enterprise domains where that procedural knowledge is most proprietary and most dangerous to lose between sessions.

The paper's entire architecture is single-player. A SKILL.md sits in a directory, one agent reads it, session ends. When Agent A at a bank figures out the right approach to parsing 13F filings (0% to 75% with the right skill in this paper), that knowledge dies with the context window. Agent B starts from scratch.

We're building shared memory infrastructure for agents at Memco (https://memco.ai) and this paper maps directly to what our enterprise design partners keep telling us — the problem isn't writing skills, it's that procedural knowledge doesn't compound across agents, sessions, or teams. The paper even shows 2-3 focused skills outperform comprehensive docs, which is a retrieval problem masquerading as an authoring problem.

The question this paper should be asking isn't "can agents write their own skills" — it's "what infrastructure makes skills accumulate and transfer?" Static files in a folder is the wrong primitive for that.

rcarmo - 13 hours ago

I only generate skills _after_ I've worked through a problem with the model - usually by asking it "what have you learned in this session?". I have no idea why people would think it can zero-shot a problem space without any guidance or actual experience...

cheema33 - 12 hours ago

> I only generate skills _after_ I've worked through a problem with the model.
This is the correct way vast majority of the time. There are exceptions. When I know for certain that the models do not have enough training material on a new library or one that isn't often used or an internal tool. In those cases I know I will have struggle on my hand if I don't start out with a skill that teaches the model the basics of what it does not know. I then update the skill with more polish as we discover additional ways it can be improved. Any errors the model makes are used to improve existing skills or create new ones.
myhf - 13 hours ago

Why would you expect it to generate more effective skills when you aren't even making a salt circle or lighting incense?

rriley - 12 hours ago

The biggest gap in this paper is the condition they didn't test: Skills built through human-AI collaboration. They found fully self-generated Skills are useless (-1.3pp) and human-curated ones help a lot (+16.2pp), but that's a false dichotomy. In practice, especially in tools like OpenClaw, skills will emerge iteratively: the AI drafts procedural knowledge while solving a real problem, the human refines it with domain expertise. Neither produces the same artifact alone. The +16.2pp from curated Skills is likely the floor for this approach, not the ceiling. Would love to see a fourth condition.

D-Machine - 7 hours ago

Insofar as they are testing very broken and simple "skills" (as evidenced by comments in this thread), and still find net significant net positives in some cases, negatives in others, and huge variability overall, I actually think this is a fun paper providing support that, at least if you count using skills here, you can, in general, very objectively and quantifiably be holding it right/wrong.

I.e. your use of skills resulting in differences of up to ~52 percentage points (or negative percentage points) in improvements (or degradations) in your percentage pass rate is a decent first-pass measure of the importance of skills here.

bee_rider - 12 hours ago

In general terms, we get these kinds of results that seem to indicate that LLMs can’t really “create” new information using inference. LLM generated skills don’t help. Training on content that was generated by LLMs causes models to collapse or something. It seems like it is accepted as really intuitive.

But it seems pretty surprising to me. The training corpus contains so much information and the models operate at the level of… a bright novice. It seems like there obviously ought to be more insights to derive from looking harder at aspects of the corpus.

Why isn’t this considered astonishing?

zozbot234 - 12 hours ago

The training corpus is only learned very approximately and poorly during pretraining. You can use inference-time compute to try and cope, but this can at best make you somewhat more self-consistent; it cannot recreate info that you didn't learn effectively to begin with!

jngiam1 - 13 hours ago

The Skills I have for Claude are all based on personal preferences and reflects the setup I have going. It's a way to narrow the probability space to the specific set which works really well for me.

alexhans - 14 hours ago

Isn't the title editorialised? Probably for clicks?

I think that most of the adoption around Agent Skills would have a focus on ease of use, standarization and context management and not correctness.

My own thoughts on how to approach skill building target people who are adopting LLM development now more than ever although this was definitely possible (in a non standard way before) [1]

[1] https://alexhans.github.io/posts/series/evals/building-agent...

Arifcodes - 8 hours ago

The distinction matters: skills that encode your team's domain-specific conventions are useful. Skills the model generates from scratch based on a vague prompt are not.

I've been building AI agent systems for clients and the pattern that works is iterative: the agent tries something, you steer it, then you capture what worked as a reusable skill. Not "generate skills before solving" but "distill lessons after solving." The paper tests the former, which nobody experienced actually does.

The real value of skills is reducing token burn on repeat tasks. Once you've figured out the right approach, you encode it so next time the model doesn't have to re-derive everything from first principles. It's memoization for reasoning.

bavarianbob - 8 hours ago

Do you have a working example of a skill reducing tokens on repeat tasks? I'm personally seeing the cost of writing and maintaining skills to be much larger than the tokens I'm saving by doing so.

andix - 12 hours ago

I think there are generally 3 kinds of skills:

1. only information and instructions on how to answer 2. some defined actions (run specific cli commands for specific tasks, use this api with those parameters) 3. skills including scripts

1 seems to be of limited use

2 and 3 can save the agent quite some time for finding a solution. And once the agent found a programmatic solution to a specific problem, they can store this information in a skill

lmeyerov - 12 hours ago

We had a measurable shift when we started doing ai-coding loops driven by evals. By definition, the additions make the numbers go up-and-to-the-right. It's the epitomy of "you get what you measure" :)

Chaos Congress talk on this from a couple months ago, jump to the coding loops part: https://media.ccc.de/v/39c3-breaking-bots-cheating-at-blue-t... . The talk focuses mostly on MCPs, but we now use the same flow for Skills.

This kind of experience makes me more hesitant to take on plugin and skill repos lacking evals or equivalent proving measurable quality over what the LLM knows and harness can handle. Generally a small number of things end up mattering majorly, but they end up being pivotal to get right, and the rest is a death by a thousand cuts.

getoffit - 14 hours ago

"Small models" will always outperform as they are deterministic (or closer to it).

This was realized in 2023 already: https://newsletter.semianalysis.com/p/google-we-have-no-moat...

"Less is best" is not a new realization. The concept exists across contexts. Music described as "overplayed". Prose described as verbose.

We just went through an era of compute that chanted "break down your monoliths". NPM ecosystem being lots of small little packages to compose together. Unix philosophy of small composable utilities is another example.

So models will improve as they are compressed, skeletonized down to opcodes, geometric models to render, including geometry for text as the bytecode patterns for such will provide the simplest model for recreating the most outputs. Compressing out useless semantics from the state of the machines operations and leaving the user to apply labels at the presentation layer.

nguyentran03 - 12 hours ago

Small models aren't more deterministic than large ones. Determinism comes from temperature and sampling settings, not parameter count. A 7B model at temp 0.7 is just as stochastic as a 405B model.
The "no moat" memo you linked was about open source catching up to closed models through fine-tuning, not about small models outperforming large ones.
I'm also not sure what "skeletonized down to opcodes" or "geometry for text as bytecode patterns" means in the context of neural networks. Model compression is a real field (quantization, distillation, pruning) but none of it works the way you're describing here.
BoredomIsFun - 7 hours ago

> "Small models" will always outperform as they are deterministic (or closer to it).
Your whole comment feels like, pardon me, like LARPing. No, small models do not outperform the large ones, unless finetuned. Saying that as someone who uses small models 95% vs cloud ones.
- 12 hours ago

[deleted]

pizza - 13 hours ago

The more general question of how to evaluate the quality of a given skill file is quite interesting to me. A skill may prime a model's responses in a way that a prompt alone may not. But also models aren't good at judging what they are or are not capable of.

Just asking a model "how good is this skill?" may or may not work, possibly the next laziest thing you could do - that's still "for cheap" - is asking the model to make a quiz for itself, and have it take the quiz with and without access to the skill, then see how the skill improved it. But there's still many problems with that approach. But would it be useful enough to work well enough much of the time for just heuristically estimating the quality of a skill?

JB_5000 - 8 hours ago

Interesting benchmark, but worth noting the methodology: skills are generated before the task, with no feedback loop. In practice, useful skills tend to emerge from doing — you attempt, observe what failed, then codify what worked. Generate → execute → observe → refine. The paper tests cold generation, which is a different (and less realistic) setup.

lukev - 13 hours ago

This clarifies an important point for me.

The derivative of a LLM agent's capabilities (on its own) is negative. It's not that they can't do useful work -- it means that (for now) they require some level of input or steering.

If that were to change -- if an agent could consistently get better at what it does without intervention -- that would represent a true paradigm shift. An accelerating curve, rather than one trending back towards linearity.

This represents a necessary inflection point for any sort of AI "takeoff" scenario.

So this study is actually kind of important, even though it's a null result. Because the contra view would be immensely significant.

Exoristos - 13 hours ago

Just to save us all some time and trouble, I'll point out that that's never really going to happen.
- nvader - 13 hours ago
  
  Source?
  - catmanjan - 13 hours ago
    
    It was revealed to me in a dream
jibal - 12 hours ago

Doesn't anyone learn from Malthus? In the real world, accelerating curves inevitably stop accelerating.
Others here have suggested that AIs should be able to self-generate skills by doing web searches. What happens when all of the information from web searches (of knowledge generated by ordinary human intelligence) has been extracted?
On another post (about crackpot Nick Bostrom claiming that an ASI would "imminently" lead to scientific breakthroughs like curing Alzheimers and so a 3% chance of developing an ASI would be worth a 97% chance of annihilating humanity) I noted that an ASI isn't a genie or magic wand; it can't find the greatest prime or solve the halting problem. Another person noted that an ASI can't figure out how to do a linear search in O(1) time. (We already know how to do a table lookup in amortized O(1) time--build a hash table.) Science is like animal breeding and many other processes ... there's a limit to how much it can be sped up.

rrvsh - 14 hours ago

Despite skills being just a new form of memory and context engineering for an agent, I think the framework is still great for agents to self-develop, given a good prompt to regularly review their own sessions and pick learning points to save as skills. In fact, I think the "craft" of prompt engineering has been lost somewhat - I still enjoy puzzling out and iterating over the best possible starting prompt for a conversation to get the best result I can for a one-shot

FWIW I didn't read the paper and am judging it based on its title, which I think is fair because "self-generated agent skills" is a pretty loose definition.

ed_elliott_asc - 14 hours ago

Develop an ai skill to read articles and come up with a HN post for you :)
- verdverm - 14 hours ago
  
  ai;dr (didn't read)

daxfohl - 10 hours ago

I wonder though if self generated skills by advanced models would improve performance of those tasks by small models, or skills created with reasoning mode enabled would help execution of those skills when reasoning is turned off.

For repetitive tasks, that could still be a good way to save on tokens and cost, while still remaining fully automated.

niraj-agarwal - 9 hours ago

https://www.seangoedecke.com/generate-skills-afterwards/ - the response

small_model - 14 hours ago

Skills seem to be a crutch until we get continual learning. Imagine you've been running an instance for 6 months and it still remembers when you told it was running on your linux server over ssh and not on your Mac.

verdverm - 14 hours ago

Search works well for this today, no need for continuous learning
Not even sure how you envision continuous learning, but if you mean model updates, I'm not sure the economics work out
- small_model - 14 hours ago
  
  Actually claude has memory files now so it has some sort of learning, I think it will improve over time and they should survive a model update.
  - verdverm - 14 hours ago
    
    putting stuff in markdown files is not "learning", it's called taking notes, like we've done for 1000s of years
    
    small_model - 14 hours ago
    
    I guess when I was in class and took notes, then reviewed them later I wasn't "learning" anything.
    
    verdverm - 13 hours ago
    
    That later "learning" part is updating weights in your brain
    What Ai's get is a cheat sheet for the session
    
    small_model - 13 hours ago
    
    That's what I mean by continual learning, skills, memory are a crutch until real learning can happen, which could be weights changing in the local instance.
    
    verdverm - 13 hours ago
    
    And my point is that weight changes are not likely to have the economic ROI for their justification on a person-by-person basis
    What you are suggesting is a very expensive late-training phase activity. It's also not clear anymore when fine-tuning helps or hurts. Progress is rapid
    
    small_model - 13 hours ago
    
    I see, I misunderstood your original message. Given how much progress has been made without it, It's perhaps not necessary especially if the economics make it prohibitive.
    
    jibal - 11 hours ago
    
    Reading notes is only necessary because of how lossy human memory is. Reading notes doesn't give you new information, it just reinforces memory paths ... which will fade and you'll have to read the notes again later unless you frequently apply the knowledge, which again reinforces those paths (but lossily, so the bits of information not repeatedly used will fade, and you will again have to read the notes if you need those bits ... or just to re-mind yourself what they were).
    
    ben_w - 14 hours ago
    
    Socrates made a similar complaint about the invention of writing, itself.

turnsout - 15 hours ago

It seems intuitive that a naive self-generated Skill would be low-value, since the model already knows whatever it's telling itself.

However, I've found them to be useful for capturing instructions on how to use other tools (e.g. hints on how to use command-line tools or APIs). I treat them like mini CLAUDE.mds that are specific only to certain workflows.

When Claude isn't able to use a Skill well, I ask it to reflect on why, and update the Skill to clarify, adding or removing detail as necessary.

With these Skills in place, the agent is able to do things it would really struggle with otherwise, having to consume a lot of tokens failing to use the tools and looking up documentation, etc.

evmaki - 14 hours ago

> I ask it to reflect on why, and update the Skill to clarify, adding or removing detail as necessary.
We are probably undervaluing the human part of the feedback loop in this discussion. Claude is able to solve the problem given the appropriate human feedback — many then jump to the conclusion that well, if Claude is capable of doing it under some circumstances, we just need to figure out how to remove the human part so that Claude can eventually figure it out itself.
Humans are still serving a very crucial role in disambiguation, and in centering the most salient information. We do this based on our situational context, which comes from hands-on knowledge of the problem space. I'm hesitant to assume that because Claude CAN bootstrap skills (which is damn impressive!), it would somehow eventually do so entirely on its own, devoid of any situational context beyond a natural language spec.
- turnsout - 12 hours ago
  
  Absolutely. This is why I'm hesitant to go full "dark software factory" and try to build agent loops that iterate in YOLO mode without my input. I spent a day last week iterating Skills on a project by giving it the same high-level task and then pausing it when it went off the rails, self-reflect, and update its Skill. It almost took me out of the loop, but I still had to be there to clear up some misunderstandings and apply some common sense and judgment.
YZF - 14 hours ago

A pattern I use a lot is after working with the LLM on a problem, directing it, providing additional context and information, ask it to summarize its learning into a skill. Then the next session that has a similar theme can start with that knowledge.
rrvsh - 14 hours ago

+1, I said as much here: https://news.ycombinator.com/item?id=47040811

ineedasername - 12 hours ago

The title should changed to reflect the actual title, as the user-provided one is incorrect and misstates a central conclusion.

rapind - 11 hours ago

Breaking news: Developers who yak shave their vim configs also get carried away with their LLM setups.

realaaa - 8 hours ago

interesting study ! however yes without real life application + internet access for the model, it is a bit of that n-dimensional horse in a vacuum kind of study

daxfohl - 11 hours ago

I'm kinda surprised by this. Yeah it's just regurgitating something it already "knows", but I'd still expect that having the skill materialized there in the context would give it something concrete to reference, less likelihood of getting lost or hallucinating, and probably need less incremental context to do the job.

I mean, basically it's doing the same thing as reasoning IIUC, except up-front rather than inline and ad-hoc, so I'd almost expect it to work even better than reasoning alone.

daxfohl - 10 hours ago

Actually, anthropomorphizing a bit, if I take something I vaguely remember how to do, say integration by parts, then if I turn off my own brain's "reasoning", then the "skill" I would generate would almost certainly be wrong, and no help in solving a problem. But if I turn reasoning on, I'd probably be able to come up with the correct algorithm and write it as a skill, sure, but if reasoning is on, I'd be able to solve the problem without needing to write down the algorithm itself (and might even be more successful that way, with something concrete to work with).
OTOH something I know innately how to do, like long division, writing down the algorithm doesn't help at all. In fact if someone just gave me that algorithm and for whatever reason I didn't recognize what it was, I'd have a lot harder time following the instructions than just innately dividing the numbers.
If course anthropomorphizing is always dangerous, but it does provide potential reasons why my above rationale could be wrong.

ryanthedev - 12 hours ago

Love the article and happy to have a framework but I don’t think those are good SWE skills.

I imagine some more like. https://github.com/ryanthedev/code-foundations

Based of an actual software book.

verdverm - 14 hours ago

Anecdotal middle ground, I have used LLM automation to generate AGENTS.md files at scale across a repo

1. You MUST review and correct them

2. Embrace minimalism, they are spark notes and an index, not comprehensive

3. Force them into context

I imagine similar concepts hold for skills

scotty79 - 12 hours ago

I think self-generation of skills might be useful if it's based on model doing websearches, experiments in a sandboxed environment and putting into skill what it found out.

Also generating skills using top of the line model to keep using them later in cheap open weights model seems like a good use of resources.

Online sharing of skills generated in such manner also seems like a wonderful idea.

j45 - 14 hours ago

I am lucky to count friends who are academics engaged in research, and one topic of discussion I notice around AI is researchers with a non-tech background and/or a lack of implementation / operationalization / commercialization in applying technology to Business, which can also cloud these kidns of results.

I have systemized and automated businesses for a long time before LLMs came out, which generally wasn't very popular.

It is really weird to see everyone get excited about this kind of automation and then try to jump to the end points with something that's non-deterministic and wonder why it doesn't work like every other computer they've used (all or none).

Agents can self generate skills, maybe not effortlessly, or with psychic skills of reading between the lines (special exception for Claude), it's also about the framework and scaffolding in which to create skills that work, and what can be brought back to the "self-generation".

Without experience in creating computer skills in general, attempts for self-generating agent skills is kind of trying to use AI to autocomplete a sentence and then not like how it went. To a fair degree it can be lined up to improve considerably.

Right now there seems to be a 6-12 month lag between studies like these and it being shared/reported in the wild.

Too often, they are researching something reported in the wild and trying to study it, and it very well may work for some cases, but not all cases, and the research kind of entirely misses it.

With AI, it's incredibly important to follow show and not tell.

Sharing this from genuine curiousity if this resonates with anyone, and if so, how/where.

kittbuilds - 7 hours ago

[dead]

kittbuilds - 10 hours ago

[dead]

renewiltord - 13 hours ago

[flagged]

anvevoice - 14 hours ago

[flagged]

sebastianconcpt - 12 hours ago

I am the only one surprised about anyone's need for a study to conclude that?

jibal - 12 hours ago

https://news.ycombinator.com/newsguidelines.html
> Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something.