Computer Use is 45x more expensive than structured APIs

reflex.dev

390 points by palashawas 16 hours ago


angry_octet - 12 hours ago

Great guidance hidden in here for making it expensive for agents to navigate your website. Move elements on screen as the mouse moves, force natural mouse movement to make the UI work, change the button labels in the JS to be randomly named every visit, force scrolling to the bottom of the screen to check for hidden extra tasks...

Hang on, that sounds like common corporate SaaS apps.

merlindru - 15 hours ago

I'm building something that fixes this exact problem[1].

The landing page doesn't advertise it yet, but essentially, I give agents a small set of tools to explore apps' surfaces, and then an API over common macOS functions, especially those related to accessibility.

The agent explores the app, then writes a repeatable workflow for it. Then it can run that workflow through CLI: `invoke chrome pinTab`

Why accessibility? Well, turns out that it's just a good DOM in general. It's structure for apps. Not all apps implement it perfectly, but enough do to make it wildly useful.

[1] https://getinvoke.com - note that the landing page is targeted towards creatives right now and doesn't talk about this use case yet

theptip - 8 hours ago

I’m missing the premise. For internal apps why would you ever reach for Computer Use vs just having your agent whip up a cli or MCP?

_of course_ computer use is worse. It is your last resort. Do not use it on state that lives in a DB that you own.

If anything I am impressed that it’s only 50x worse.

jacktu - 15 hours ago

Totally agree. I’ve been building an AI visual tool recently and experimented with both approaches. The latency and c ost of generic "agentic" browser use are absolute dealbreakers for real-time consumer apps right now. Structured APIs (even just chained LLM calls with strict JSON schemas) are not only 40x cheaper, but more importantly, they are deterministic enough to actually build a stable product on top of. Computer use is an amazing demo, but structured APIs are what pay the server bills.

Worf - 13 hours ago

Is it possible to ask the vision agent to "map" the UI and expose it to another agent as a set of interfaces that resemble an API better? From what I understand the vision agent now should both know that "next page" shows more results and that they need to get more results in the first place.

If one agent just explores the UI, maybe in a test environment, and outputs a somewhat-structured description of the various UI elements and their behavior, then another agent was given that description, would the other agent perform better that an agent that both explores the UI and tries to accomplish the given task at the same time?

With an example UI I made up, the description (API-like interface definition) could be something like:

  Get all reviews:

  To get all the reviews you need to go to each page and click "show full review" for every review summary in that page.

  Go to each page:

  Start at page 1 (the default when in the Reviews tab). Continue by clicking the "next" button until the "next" button is no longer available (as you've reached the last page).
So the second agent can skip some thinking about how to navigate because it already has that skill. The first agent can explore the UI on its own, once, without worrying about messing up if there's a test environment.

Or am I misunderstanding the article completely? Probably. But it's interesting nonetheless. Sorry if it makes no sense.

oleg2025 - 2 hours ago

Couple of months ago I was inspired by kubectl, and built desktopctl CLI to control GUI apps. It uses combination of OCR and Accessibility API on Mac, represents UI as markdown, and exposes actions for mouse and keyboard.

My core idea was that "fast" perception loop is fully local, GPU optimised for UI tokenisation and change detection. "Slow" control loop requires LLM roundtrip, and uses token-efficient markdown interface in CLI output.

It uses relatively stable identifiers for controls, so agents can script common actions, eg `desktopctl pointer click --id btn_save` doesn't require UI tokenisation loop.

https://github.com/yaroshevych/desktopctl/tree/main

zhxiaoliang - 9 hours ago

I'm always skeptical of the whole "computer use" concept. It's like hiring someone and inviting him to your house and telling him to go ahead, feel free to sleep on the bed, use the toilet, eat whatever is in the fridge, watch the TV, and oh here are the combinations for the safe... and that someone you hire is a monkey.

mbgerring - 7 hours ago

Hello from the distant past, when being able to easily consume a website via API was an exciting and fresh idea for humans, before robots could effectively use the computer

https://en.wikipedia.org/wiki/HATEOAS

rahulyc - 13 hours ago

All the websites currently blocking Claude Code or other AI agents are fighting a losing battle. Computer-use is in the early stages, and the thing preventing mass-adoption seems to be the number of tokens it takes. Agents can fumble around trying 10 CLI commands that don't work before finding the right one and we barely notice. But other visual agents (browser use / computer use etc) end up eventually fumbling on to the right thing, but we don't have the patience to wait 20 mins. to click a button. As tokens get cheaper + faster, we probably get the models that can use a UI interface just as natively as a CLI.

janalsncm - 15 hours ago

Wall clock time tells me everything I need to know. The vision model took almost 20 minutes to do the thing that Sonnet did in 20 seconds.

The only reason you wouldn’t choose an API is if it wasn’t viable.

antves - 15 hours ago

I think one main point is that not all "computer use" is the same, the harness and agentic experience matters a lot. A poorly designed API experience can actually be _less_ efficient than a well designed browser or computer use experience

In particular, the vision-based approach used in the evaluation has clear limitations with regard to efficiency due to its nature (small observation window, heterogeneous modality)

At Smooth we use an hybrid DOM/vision approach and we index very strongly on small models. An interesting fact is that UIs are generally designed to minimize ambiguity and supply all and only the necessary context as token-efficient as possible, and the UX is cabled up to abstract the APIs in well-understood interface patterns e.g. dropdowns or autocompletes. This makes navigation easier and that's why small models can do it, which is another dimension that must be considered

We typically recommend using APIs/MCP where available and well designed, but it's genuinely surprising how token-efficient agentic browser navigation can actually be

orliesaurus - 13 hours ago

Computer Use? Or Browser Use? IMHO big diff

The problem is that not everything from the 'past' can be accessed via APIs. It would be a fun time - remember Prism [1] - I would just run that and get all the API calls in a nice format and then replay them over and over to do things in succession.

In the new world, we have access to OpenAPI.json and whatnot, but in the world where things were built in the days pre-OpenAPI and pre-specs and best practices...I am not so sure! (and a lot of world lives then)

Alas, this works for a good chunk of things but not everything. Which is why the other technnology exists.

[1] https://stoplight.io/open-source/prism

Frannky - 5 hours ago

I want to just talk to the Mac and have it do things. I tried computer use and other alternatives, but the latency made it unusable.

I want to be able to control both Mac, apps and the browser. I also need it to figure out things by itself given a goal.

Claude Code with the --chrome flag is kind of good, but it's too slow. I wanted to try faster APIs, like the one hosted on Cerebras, but it's too expensive.

Any solution I might be missing?

aurareturn - 16 hours ago

In an agentic world, the OS needs to be completely rethought. For example, every single app functionality should be exposable via an API while remaining human friendly.

I think OpenAI designing their own phone is the next logical step. I hope they succeed which should bring major competition to Apple and Android.

- 13 hours ago
[deleted]
johnsmith1840 - 13 hours ago

Text based web browsing? Would love the comparison there. Tons of systems have a dom translation layer. I'm building around this with the concept of turn a webpage into text for an agent to use directly. I actually had to move away from haiku not because of accuracy problems but because it operated the browser too fast for a human to follow what it was doing. The real loss here are bespoke webapps like a figma or google docs which are near impossible to see what they are doing via the dom.

To me the browser is a translation layer. Working on the browser directly while hard enables big advantages on compatibility. The only thing I miss as of now which is on the todo is ocr of the images in the browser into text out. But an api would need to do that anyways to work.

The main loss in my view of pure API based is, where do you get the data? We won't replicate human work without seeing that done. Humans work in the UI that's it. Computer use to me is the promise of being able to replicate end to end actions a human does. API can do that in theory but the data to do that is also near impossible to collect properly.

_boffin_ - 16 hours ago

What i don't understand about "computer use" is why they're not just grabbing the window handles and storing them to determine what should be clicked after the first few iterations of using that a specific application. if a new case / path / whatever is found, drop back to screen grabbing and bounding boxes and then figure the handles that are there and store after.

idk.. not really thought out too much, but has to be better

etothet - 13 hours ago

Vision has a long way to go. I remember trying an early version of AWS's Nova Act and laughed at how slow it was. And a few months later it hadn't really seemed to improve that much.

Recently, I asked Claude to log into my local grocery store chain's website and add all of the items from my shopping list to a cart. It was hilariously slow, but it did get the job done.

Unless I missed it, the article doesn't explictly mention speed in the copy, but the results do show a 17 minute (!!!) total time for the vision agent vs. 0.5s - 2.8s for the API approach.

A big part of the challenge with vision is that to manipulate the DOM, you first have to be sure the entire (current) DOM is loaded. In my experience this ends up in adding a lot of artificial waits for certain elements to exist on the page.

_heimdall - 7 hours ago

We gave up on structured APIs 20+ years ago when JSON RPC largely replaced XML REST. You can do REST in many different formats, it mainly just needs to be structured data and self-discoverable.

Had we not made that wrong turn, LLMs and humans would have a much easier time reasoning about APIs they don't directly control.

svnt - 16 hours ago

> This is not a model problem. The vision agent was reasoning about a rendered page and had no signal that the page wasn't showing everything.

> To make the comparison apples-to-apples, we rewrote the vision prompt as an explicit UI walkthrough, naming the sidebar items, tabs, and form fields the agent should interact with at each step. Fourteen numbered instructions covering the navigation the agent had failed to figure out on its own.

This is a model problem, though. Because the model failed to understand it could scroll, you forced it to consume multiples of the tokens. Could you come up with an alternative here?

Do you know what the vision model was trained on? Because often people see “vision model” and think “human-level GUI navigator” when afaik the latter has yet to be built.

Havoc - 16 hours ago

Isn't it possible to somehow wire this into the window manager? Wayland or whatever. Have it speak the native window lang rather than crunch the pixels? At least for the majority.

I can see the appeal in pixel route given universality but wow that seems ugly on efficiency

sheepscreek - 14 hours ago

This tracks - has been my experience exactly. Not to mention there isn’t particularly a significant lift in inaccuracy or speed. As things stand, to me it is the worst of both worlds. Expensive and inaccurate.

ai_fry_ur_brain - 15 hours ago

Its funny watching the slow mean reversion back to more deterministic tooling.

euphetar - 8 hours ago

I wouldn't call it a benchmark since it's just one sample. They do highlight a real problem, though. Computer use is immature right now and far behind language agents

Try playing fruit ninja via text and llm toolcalls though

sudb - 16 hours ago

I'm pretty unsurprised that the vision agent did worse. I'd be interested in a comparison between the different tools that now exist to let LLMs drive browsers (e.g. vercel's agent-browser, the relatively new dev-browser[1], etc.)

There are usecases where the vision agent is the more obvious, or only choice though, e.g. prorprietary/locked-down desktop apps that lack an automation layer.

1. https://github.com/SawyerHood/dev-browser

cjbarber - 16 hours ago

I think of computer use as like last mile delivery. APIs and bash and such are the efficient logistics networks. Both have different benefits. Obviously, use the efficient methods when you can.

zmmmmm - 9 hours ago

And structured APIs are about 1e9x more expensive than not invoking an LLM in the first place compared to using deterministic code to do something ... it's not like any of this is rational based on compute.

rootcage - 15 hours ago

The best use cases I've seen for computer/browser use is for legacy SaaS/Software. For example, hotels use archaic Property Management Systems (PMS) and they're required by corporate to use it and pay for it. These companies can barely keep the product alive, they definitely aren't incentivized to maintain an API. In such a case browser use agent seems to be the best (only) way.

game_the0ry - 9 hours ago

My "best practice" is to use as little "visual" (computer use) tooling and as much api + cli tooling as possible specifically to save on tokens.

Tokens a resource and should be managed as such.

2001zhaozhao - 15 hours ago

I have only found Computer Use useful for GUI app local debugging. Presumably it will also be useful for getting around protections for external apps that don't want AI to interact with them, or for interfacing with legacy apps or those built without AI in mind.

I don't think any new app should ever be specifically designed for AI to interact with them through computer use

brikym - 11 hours ago

It would be great if institutions like banks provided proper APIs.

danpalmer - 8 hours ago

Metadata and structure beats AI every time.

arjunchint - 13 hours ago

The hard part about the web is that API's aren't just available even if the website owner wants them exposed (big if).

I embedded a Google Calendar widget on my Book a demo page, I don't know the API and Google doesn't expose/maintain one either.

What we are doing at Retriever AI is to instead reverse engineer the website APIs on the fly and call them directly from within the webpage so that auth/session tokens propoagate for free: https://www.rtrvr.ai/blog/ai-subroutines-zero-token-determin...

dfee - 9 hours ago

by design: https://en.wikipedia.org/wiki/Desire_path

IMO, this is the argument for doing work in the first place.

jasomill - 5 hours ago

In what world would a vision agent be the default, when whatever HTTP-based mechanism a site uses to communicate with the server can usually be reverse-engineered and easily emulated with widely available HTTP request libraries, HTML parsers, and JavaScript engines, and at worst you can use something like Puppeteer to navigate and control applications at a significantly higher level than image scraping and simulating user input?

It seems like you'd need a deliberately hostile app before a vision agent would even be considered as an option.

chrismarlow9 - 7 hours ago

Blackhat SEO spamming knew this 20 years ago

sarmike31 - 11 hours ago

Just wondering: RPA companies like UiPath ard dead in the water, right?

overgard - 15 hours ago

I've been thinking of things I'd want an agent for recently. The problem is, everything I think of is something that requires using quite a few different websites, saving a lot of data securely, and working with a lot of sensitive accounts (my email, etc.)

The problem is, all the tasks are essentially: a) things agents probably just can't do, and b) things that absolutely cannot afford to be hallucinated or otherwise fucked up. So far the tasks I've thought of:

- Taxes. So it needs a lot of sensitive information to get W2's. Since I have to look up a lot of this stuff in the physical world anyway, it's not like I can just let it run wild.

- Background check for a new job. It took me 3 hrs to fill out one of them (mostly because the website was THAT bad). Being myself, I already was making mistakes just forgetting things like move in dates from 10 years ago, and having to do a lot of searching in my email for random documents. No way I'm trusting an agent with this.

- Setting up an LLC. Nope nope nope. There's a lot of annoying work involved with this, but I'm not trusting an LLM to do this.

Anyway, I guess my point is that even if an LLM was good at using my computer (so far, it seems like it wouldn't be), the kind of things I'd want an agent for are things that an LLM can't be trusted with.

dist-epoch - 15 hours ago

It doesn't matter.

Electron uses 10x more RAM than regular apps. But it's so convenient.

Python is 100x slower than C. It's in the top 3 of languages now.

Worse but more convenient always wins.

moralestapia - 16 hours ago

This is obvious. The problem is that not everything has an API, while everything has a human-oriented UI.

m3kw9 - 5 hours ago

I did a simple computer use to search something, and used up 50% of my 5h plan limit from codex.

morpheos137 - 5 hours ago

Who would have thunk? You know what is a great LLM agent api? bash. vast corpus, text based, already traindd in the model.

hamasho - 6 hours ago

I'm trying to use computer use and browser use (via playwright MCP) in my work. Computer use is a hit and miss (mostly miss), but playwright MCP often works very well. The downside is it takes a lot of time to complete even easy tasks.

For example, to automate processing emails, it needs to 1. go to Gmail 2. log in to Google if necessary (This often requires two step verification so it's hard to completely automating, but possible) 3. read the latest mail 4. check the content and choose the action - if needed, reply the email - if it mentions tasks, add them to the todo list - if it mentions schedules, add them to the calendar 5. repeat for all emails based on specified conditions. And each step requires dozens of DOM (a11y tree) analyzes and actions (fill username/password input, check keep logging in, click submit button, etc). Based on the model used, each step can take ~100s. So easy tasks can easily add up to tens of minutes or even hours.

For frequently used tasks, I write skills like /logging-in, /read-latest-emails, using playwright scripts and let the agent choose them And based on the email content, the agent chooses other tools like /write-reply, /add-todo, /add-event, etc, so that the model can only focus on the core tasks requiring thinking. It reduces the execution time drastically.

But it can buries important business logic in the playwright scripts, not the agent's instructions. For examples, simplified steps to add TODO items are like; 1. read the email 2. check if it's about todos, then decide to add them to Asana 3. extract and summarize the title, content, priority, due date, tags, etc. 3. access to Asana (log in if necessary) 4. check if there are similar tasks 5. if not, add the tasks This can take tens of minutes, and each step can have important business logic, like; - how to decide the priority and due date - how to choose tags based on the content - how to decide if two tasks are similar This information should be read and updated by not only developers, but managers and other teams. And if I write those steps in skills with playwright scripts, it improves the speed, but all those business logic are buried in the code, so not accessible by non-technical people. It's also error-prone because web sites often tweak the UI and scripts can stop working.

So it's very convenient if the agent processes these step once, then decides it's worth writing the playwright script so that the next time those mundate processs can be executed instantly.

With automatic skill generation, the agent decides by itself if there are workflows worth writing skills with playwright scripts, like /log-in, /extract-information, /check-similar-tasks, /add-tasks. Like Just-In-Time compiler, the skills are a byproduct of the agent instruction, all business logic are written in the agent's instruction, and doesn't need to be updated manually nor tracked in a version control system.

This can reduce a lot of execution time and API cost, and be applied other than browser automation, like computer use or any other agentic tasks if it's possible to write automation scripts for tasks not requiring thinking.

zephen - 15 hours ago

I find this extremely surprising.

When you think of everything it takes for an AI to use what the article calls a "vision agent" then it seems as if using a purpose-made API ought to be MANY orders of magnitude faster.

j45 - 8 hours ago

Sounds like some efficiency gains will still arrive.

doctorpcgum - 5 hours ago

Bh

RobRivera - 13 hours ago

UX feedback

Me: hmm, this title confuses and infuriates Rob.

[Clicks link]

Me: Sees same title, repeat feelings of confusion and infuration

[Scrolls article down on my smartphone]

Me: Sees jpg with the same title, repeat feelings of co fusion and infuriation.

[Closes tab]

[Continues living rest of my life]

I hope this feedback is well received and understood.

mrcwinn - 11 hours ago

We need a superset of HTML that is designed for agents. I'm not sure it's quite as simple as "just make everything an API."

ipunchghosts - 12 hours ago

I have a similar finding for a website I made that collates college town bar specials and live music. Using agents with vision models works but it's not as straightforward as one would initially think. U can check out the results here. https://www.nittanynights.com

creatonez - 12 hours ago

Browser agents / vision agents are a menace and ISPs should outright ban subscribers who run them on the public internet.

sanderjd - 14 hours ago

Only 45x?

taormina - 16 hours ago

The interface designed for humans is poor for AI needs? And the interface designed for programmatic use is easier for the AI to use? In other news, the sky is blue and water is wet.

gowld - 15 hours ago

Confusing title? "Computer Use" is actually "Browser vision"?

- 16 hours ago
[deleted]
deafpolygon - 12 hours ago

This is missing the point that AI training probably costed boatloads more to achieve to get here.

theabhinavdas - 12 hours ago

For now.

sneefle - an hour ago

[flagged]

BionicAI - 14 minutes ago

[flagged]

jacktu - 3 hours ago

[dead]

WhoffAgents - 12 hours ago

[flagged]

momo26 - 6 hours ago

[flagged]

Amber-chen - 6 hours ago

[dead]

lacymorrow - 12 hours ago

[dead]

rgilliotte - 13 hours ago

[dead]

overlord1109 - 6 hours ago

[dead]

volume_tech - 15 hours ago

[flagged]

doctorpcgum - 5 hours ago

[flagged]

faangguyindia - 16 hours ago

I saw Codex was screenshotting, then clicking around. I just stopped it and never used that again.

Using CLI tools is much faster and token-efficient. I developed ten apps in the last two months. One reached 10,000+ monthly active users.

I ask Codex to generate SVG line by line and backtrack edit, ask it to use Inkscape to generate icons, etc...

I developed all this on $20 codex sub.

bottlepalm - 11 hours ago

There's no way this is true. I would argue in some cases computer use is less expensive. First for APIs that don't even exist, it's a non starter. Second most APIs are not designed for agents and are verbose as hell - returning the entire DTO and tons of unnecessary properties burns tokens. Second computer use is not as token hungry as you think it is - a single screenshot may be just 1000 tokens, it's actually competitive and beats API workflows in many cases.

0xWTF - 11 hours ago

So, to make this concrete, Akasa uses computer vision to read medical records to replace medical coders because there aren't enough medical coders to get all the billing right and medical systems leave like $1T a year on the table.

The EHRs could give companies like Akasa API access so Akasa could then just run NLP, but the EHR vendors don't grant various third parties API access for various reasons, so instead Akasa gets a seat license for each medical system they service and uses computer vision to read the screen (a cadre of Akasa medical coders review errors to stay up to date with unannounced changes from the EHR vendors) and then runs the NLP to figure out which CPT codes to assign to actually put in a bill and send the payer so the hospitals can stay afloat.

So this 45x delta is how much more the medical systems pay Akasa because Epic won't work with Akasa.

This is but one example of why US medical bills are outrageously high.