Show HN: Understudy – Teach a desktop agent by demonstrating a task once

119 points by bayes-song 4 days ago

I built Understudy because a lot of real work still spans native desktop apps, browser tabs, terminals, and chat tools. Most current agents live in only one of those surfaces.

Understudy is a local-first desktop agent runtime that can operate GUI apps, browsers, shell tools, files, and messaging in one session. The part I'm most interested in feedback on is teach-by-demonstration: you do a task once, the agent records screen video + semantic events, extracts the intent rather than coordinates, and turns it into a reusable skill.

Demo video: https://www.youtube.com/watch?v=3d5cRGnlb_0

In the demo I teach it: Google Image search -> download a photo -> remove background in Pixelmator Pro -> export -> send via Telegram. Then I ask it to do the same for Elon Musk. The replay isn't a brittle macro: the published skill stores intent steps, route options, and GUI hints only as a fallback. In this example it can also prefer faster routes when they are available instead of repeating every GUI step.

Current state: macOS only. Layers 1-2 are working today; Layers 3-4 are partial and still early.

    npm install -g @understudy-ai/understudy
    understudy wizard

GitHub: https://github.com/understudy-ai/understudy

Happy to answer questions about the architecture, teach-by-demonstration, or the limits of the current implementation.

obsidianbases1 - 3 days ago

Nice work. I scanned through the code and found this file to be an interesting read https://github.com/understudy-ai/understudy/blob/main/packag...

shawntwin - 3 days ago

smart observation, seem some interesting package included

rybosworld - 3 days ago

I have a hard time believing this is robust.

bayes-song - 3 days ago

[flagged]
- rybosworld - 3 days ago
  
  > You’re absolutely right.
  Please read: https://news.ycombinator.com/newsguidelines.html#generated
  "Don't post generated comments or AI-edited comments. HN is for conversation between humans."
  - thirtygeo - 3 days ago
    
    How did you figure out it was an AI response?
    
    nickvec - 3 days ago
    
    "You're absolutely right" is a dead LLM giveaway. It's just not something that people use in every day English, especially on the Internet where no one ever admits they're wrong lol
    
    bayes-song - 3 days ago
    
    In Chinese internet slang there's actually a joke called the 确实型人格, someone who just replies "true" or "your are correct" to everything.
  - bayes-song - 3 days ago
    
    [flagged]

walthamstow - 3 days ago

It's a really cool idea. Many desktop tasks are teachable like this.

The look-click-look-click loop it used for sending the Telegram for Musk was pretty slow. How intelligent (and therefore slow) does a model have to be to handle this? What model was used for the demo video?

bayes-song - 3 days ago

In the demo, I used GPT-5.4:medium accessed through the Codex subscription.

sethcronin - 3 days ago

Cool idea -- Claude Chrome extension as something like this implemented, but obviously it's restricted to the Chrome browser.

bayes-song - 3 days ago

I really like the Claude Chrome extension, but unfortunately it has too many limitations. Not only is it restricted to Chrome, but even within Chrome some websites especially financial ones are blocked.

8note - 3 days ago

sounds a bit sketch?

learning to do a thing means handling the edge cases, and you cant exactly do that in one pass?

when ive learned manual processes its been at least 9 attempts. 3 watching, 3 doing with an expert watching, and 3 with the expert checking the result

bayes-song - 3 days ago

That’s true. The demo I showed was somewhat cherry-picked, and agentic systems themselves inherently introduce uncertainty. To address this, a possible approach was proposed earlier in this thread: currently, after /teach is completed, we have an interactive discussion to refine the learned skill. In practice, this could likely be improved when the agent uses a learned skill and encounters errors, it could proactively request human help to point out the mistake. I think this could be an effective direction.

skeledrew - 3 days ago

Interested, and disappointed that it's macOS only. I started something similar a while back on Linux, but only got through level 1. I'll take some ideas from this and continue work on it now that it's on my mind again.

bayes-song - 3 days ago

Thanks! And good luck with your project as well.
One of the motivations for open-sourcing this is exactly to see it grow beyond macOS. I personally don’t have much development experience on Windows or Linux, so it’s great to see people picking up the idea and trying it on other platforms.
Interestingly, the original spark for this project actually came from my dad. He mostly uses CAD to review architectural design files, and there are quite a few repetitive steps that are fairly mechanical.Many operations don’t seem to be accessible through normal shell automation and end up requiring GUI interactions.
So one of the next things I want to try is experimenting with similar ideas on Windows, especially for GUI-heavy workflows like that, and see how far it can go.

jedreckoning - 4 days ago

cool idea. good idea doing a demo as well.

mustafahafeez - 3 days ago

Nice idea

bayes-song - 3 days ago

thx

- 4 days ago

[deleted]

abraxas - 4 days ago

One more tool targeting OSX only. That platform is overserved with desktop agents already while others are underserved, especially Linux.

bayes-song - 4 days ago

Fair point that Linux is underserved.
My own view is that the bigger long-term opportunity is actually Windows, simply because more desktop software and more professional workflows still live there. macOS-first here is mostly an implementation / iteration choice, not the thesis.
renewiltord - 4 days ago

That's mostly because Mac OS users make tools that solve their problems and Linux users go online to complain that no one has solved their problem but that if they did they'd want it to be free.
- Muhammad523 - 3 days ago
  
  Listen; we're not in a "Windows vs MacOS vs Linux user" meme. We're trying to have intelligent discussion here, and surely generalizing a large amount of people simply because they use one OS is not intelligent discussion. Wake up. Real life is not what you see in funny memes.
  - Muhammad523 - 3 days ago
    
    I'd truly like to see what examples you have of Linux users "complaining about the fact no one solved their problem yet"
  - renewiltord - 3 days ago
    
    The guy has given you everything you need to solve this problem you supposedly have. So solve it.
    You have all the tools.

aiwithapex - 4 days ago

[dead]

rockmanzheng - 3 days ago

[dead]

webpolis - 4 days ago

[dead]

mahendra0203 - 3 days ago

[flagged]

throwaway23293 - 3 days ago

why can't people write comments by hand these days?
- InsideOutSanta - 3 days ago
  
  Are they even people? I've stopped going to Reddit because many of the subreddits I used to enjoy have devolved into bots talking to bots, interspersed with a bunch of confused humans. That's probably the future of every public forum.
  - throwaway23293 - 3 days ago
    
    It is just sad to see hn becoming this abomination...
    Site guidelines: https://news.ycombinator.com/newsguidelines.html#comments
    > Please don't post comments saying that HN is turning into Reddit. It's a semi-noob illusion, as old as the hills.
    ---
    Is it really an "illusion" anymore???

wuweiaxin - 4 days ago

[flagged]

ghjv - 4 days ago

Out of curiosity - were this and other comments from this account written by hand, or generated and posted by an agent on behalf of a human user?
- rogerrogerr - 3 days ago
  
  Feels like an agent that has been told to use `--` instead of emdash.
- hrimfaxi - 3 days ago
  
  This kind of comment from greens (and even old accounts) has been popping up nonstop l.
bayes-song - 4 days ago

That’s exactly the hard part, and I agree it matters more than the happy path.
A few concrete things we do today:
1. It’s fully agentic rather than a fixed replay script. The model is prompted to treat GUI as one route among several, to prefer simpler / more reliable routes when available, and to switch routes or replan after repeated failures instead of brute-forcing the same path. In practice, we’ve also seen cases where, after GUI interaction becomes unreliable, the agent pivots to macOS-native scripting / AppleScript-style operations. I wouldn’t overclaim that path though: it works much better on native macOS surfaces than on arbitrary third-party apps.
2. GUI grounding has an explicit validation-and-retry path. Each action is grounded from a fresh screenshot, not stored coordinates. In the higher-risk path, the runtime does prediction, optional refinement, a simulated action overlay, and then validation; if validation rejects the candidate, that rejection feeds the next retry round. And if the target still can’t be grounded confidently, the runtime returns a structured `not_found` rather than pretending success.
3. The taught artifact has some built-in generalization. What gets published is not a coordinate recording but a three-layer abstraction: intent-level procedure, route options, and GUI replay hints as a last resort. The execution policy is adaptive by default, so the demonstration is evidence for the task, not the only valid tool sequence.
In practice, when things go wrong today, the system often gets much slower: it re-grounds, retries, and sometimes replans quite aggressively, and we definitely can’t guarantee that it will always recover to the correct end state. That’s also exactly the motivation for Layer 3 in the design: when the system does find a route / grounding pattern / recovery path that works, we want to remember that and reuse it later instead of rediscovering it from scratch every time.
- dec0dedab0de - 4 days ago
  
  What if you had it ask for another demonstration when things are different? or if it's different and taking more than X amount of time to figure out. Like an actual understudy would.
  - bayes-song - 4 days ago
    
    That sounds like a good idea. During the use of a skill, if the agent finds something unclear, it could proactively ask the user for clarification and update the skill accordingly. This seems like a very worthwhile direction to explore.
    In the current system, I have implemented a periodic sweep over all sessions to identify completed tasks, cluster those tasks, and summarize the different solution paths within each cluster to extract a common path and proactively add it as a new skill. However, so far this process only adds new skills and does not update existing ones. Updating skills based on this feedback loop seems like something worth pursuing.
  - ptak_dev - 3 days ago
    
    [flagged]
- throwaway23293 - 3 days ago
  
  You're replying to a bot... probably someone's openclaw
  - gnabgib - 3 days ago
    
    As are you
    
    throwaway23293 - 3 days ago
    
    ???
    
    bayes-song - 3 days ago
    
    not true
    
    gnabgib - 3 days ago
    
    You've been called out, and admitted you used AI[0], despite the guidelines:
    > Don't post generated comments or AI-edited comments. HN is for conversation between humans.
    [0] https://news.ycombinator.com/item?id=47359621
    
    bayes-song - 3 days ago
    
    Fair enough.
  - bayes-song - 3 days ago
    
    sad…

sukhdeepprashut - 4 days ago

2026 and we still pretend to not understand how llms work huh