A robot is sprinting towards you. Do you want it running on Claude or Grok?

openrouter.ai

154 points by Usu 4 hours ago


delichon - 3 hours ago

If the robot appears to be bringing me a taco, it would probably penetrate all of my defenses. Grok is currently more likely than Claude to arrive with the taco without being stopped by an export control directive.

hariseldom - 2 hours ago

> I didn’t add any frontier-tier models like Opus 4.7, GPT-5.5, or Gemini Ultra. At their prices, 30 games would have cost around $3,000 instead of $482.

I have a lot of thoughts unrelated to the game experiment but more about how these opus/ultra size models can possibly be a financially viable product at scale when it costs $3000 to play 30 simple games. It just seems much much higher than what it would cost to get a human to play 30 rounds

lanewinfield - 3 hours ago

Cost per kill ("CPK" in industry lingo) is a dark phrase that feels disturbingly within reach of some of these companies.

bel8 - 2 hours ago

DeepSeek V4 Flash being the winner in cost efficiency causes me exactly zero surprise.

It's a monster at coding. And a fast monster at that.

I use it daily and have been testing if MiMo 2.5 (non pro) is comparable. The nice thing about MiMo is that it has vision capability.

thomasfromcdnjs - 3 hours ago

I was loving grok-4.1-fast, very good and cost effective.

But it's not actually 4.1 anymore they silently rerouted it to 4.3 and just started charging more - https://www.reddit.com/r/grok/comments/1ta8yrn/grok_41_fast_...

Quite a bad practise.

rglover - an hour ago

It's already sprinting at me?

Racks shotgun. I don't really care what model it's running.

pianopatrick - 3 hours ago

Ya know, maybe we could just not have robots that sprint. Seems people would be more willing to accept living amongst robots that are slow and that humans could easily over power.

hennell - 2 hours ago

Claude being so friendly is interesting, but grok being best at games isn't so surprising - I assume Elons been using it to level up his characters in all the video games he pretends to be good at.

trb - 3 hours ago

  L icon Grok 4.1 Fast won 13 of 30 games at $0.97 per win

  The next-best winner was A icon Claude Sonnet 4.6 with 5 wins, at $26.78 per win. That’s a 27x difference. The model that isn’t on most top-model lists beat the model that is, on the thing a routing customer actually cares about.

  The model with the most kills did not win

  H icon GPT 5.4 killed 38 agents across 30 games. More than anyone else. It came in second on the leaderboard with 2 wins. 
If grok-4.1-fast was the top-winning model, and Claude 4.6 Sonnet the second, how did Gpt-5.4 come in second on the leaderboard? Which one is second, Claude 4.6 Sonnet or Gpt-5.4?

  There were 11 games between “best at killing” and “best at winning”.
What does that mean? How are there 11 games between "best a killing" and "best at winning"?
aykutseker - 2 hours ago

Claude trying to make friends in a battle royale is funny.

But if the robot is anywhere near my house, I think I want the one that hesitates.

slashdave - 24 minutes ago

Well, if it is running off of Anthropic's infra, then Claude?

QuantumNoodle - 2 hours ago

_dont create benchmarks that will incentivize ai labs to optimize towards... Especially ones like battle royal!_

deepsun - 2 hours ago

Sprinting? More like buzzing (or rolling for terrestrial drones).

It's already in mass production, just with simpler models for now.

The most ubiquitous would be "silently watching".

fragsworth - an hour ago

Are we sure the prices in these charts are sustainable prices? Is it possible that Grok may be subsidizing a lot more of the costs than the other models, to produce growth metrics, due to the recent SpaceX IPO?

paytonjjones - 2 hours ago

Super entertaining article — petition to change the clickbait title

a_victorp - 3 hours ago

I wish the author would open source the full benchmark. I'm curious how sensitive the results would be to small changes in the benchmark initial conditions

notatoad - 2 hours ago

sprinting towards me to help me, or sprinting towards me to hurt me?

i feel like i'm missing a whole lot of context to this article. is it part of a series, or just written with an assumption that i'm going to know what they're talking about

peterspath - 3 hours ago

Quite an interesting way of testing models and showcasing differences between them. Enjoyed the read :)

CodeWriter23 - an hour ago

I'll pass on the whole robot sprinting at me scenario.

vitalyan123 - 2 hours ago

>The model that won is Grok 4.1 Fast. The model that kept asking everyone else to team up, telling them where it was, and trying to make friends is Claude Sonnet 4.6. The first one is the one that wins a battle royale. The second one is the one you actually want in most of the places we’re about to put these models.

what

0xbadcafebee - 32 minutes ago

The obvious answer is "neither". How's a sprinting robot going to react when the wifi goes out, or there's too many people writing code and the models decide to take a nap? You want a local model for a robot, not only for low latency, but reliable safe operation. VLA models as small as 0.4B work fine, up to something like 55B.

dofm - 2 hours ago

I don’t want anything running on Grok.

Groxx - 3 hours ago

I parry the taco and use Vicious Mockery.

thisisauserid - 2 hours ago

I want it running JEPA. Preferably with Mamba-3.

grey-area - 3 hours ago

Neither. I’d rather it used something other than an LLM.

JimsonYang - 3 hours ago

Grok-assasin Claude-priest/healer Deepseek-expendable mini units

attentive - 3 hours ago

missing gemini-3.1-flash-lite and gemini-3.5-flash

stevenalowe - 3 hours ago

How about thin ice?

johnwheeler - 3 hours ago

Claude--even though it's smarter, it's probably not insane.

SmirkingRevenge - 2 hours ago

I don't really want the mecha-hitler model running towards me or anywhere

jongjong - 2 hours ago

This shows the limits of intelligence.

Claude trying to organize and collaborate, expecting reciprocity only works if other agents are as intelligent as you and share your values... And almost certainly neither is ever true in the real world where there are so many agents.

nailer - 2 hours ago

Grok. Claude and other models value “white” people less than others in testing. If you want I can look it up.

deadbabe - 2 hours ago

Here’s what I don’t get: while this makes for a fun blog post, you can just program an efficient killing machine that probably wins all the time and has $0 in token costs. LLMs should work to build such a machine, not be the machine themselves.

The things LLMs are good at, you do not actually need for an agent like this. You can use classical AI methods. But that would be a boring article.

yieldcrv - 3 hours ago

Grok

It has something actionable that will match its actions

bitwize - 3 hours ago

I don't care what it's running, only that I have sufficient ordnance to stop it.

sublinear - 3 hours ago

This is interesting, but not sure if it's in the way the author intended.

People experience the world through the tools they're most familiar with. For some people, that's throwing money at things. I suppose from a sufficiently high level perspective everything is gambling.

Back when Battlebots was a big deal, I never once considered what it would feel like to be the management or sponsorship of those teams. I only cared about the actual battling of bots.

fragmede - 3 hours ago

A self driving car is taking you to the hospital. Do you want it to follow the speed limit and all road safety laws? Claude or Grok?

exabrial - 3 hours ago

A moron is sprinting towards you. Do you want them swiping through TikTok or Instagram?

wolfi1 - 2 hours ago

neither. I jump

egypturnash - 2 hours ago

Grok is more likely to be looking to murder me for being a trans lady, what with it being owned by Elon Musk.

But really I would prefer whichever one is most likely to trip and fall over.

zzzeek - 3 hours ago

claude because it would be more ethical, grok because I can just trip it and it will shatter into pieces

pigeons - 3 hours ago

The text seems deliberately stripped of llmisms that flag detection. However, not a single line shakes the smell off

ProofHouse - 2 hours ago

Is this a joke? Grok all day. Thing is gonna get a beer with ya!

antonvs - 3 hours ago

Grok for sure. It’ll notice I’m not Jewish or Black. First they came for…

smallerfish - 3 hours ago

> I dropped eleven LLMs into a 2D battle royale and made them play 30 games. One won 43% of the matches. Three never won a single game. The cheapest model in the lineup beat the most expensive one by 27x on cost per win.

Please learn how to write with AI without giving away that it was written by AI.

neuronexmachina - 2 hours ago

[dead]

codelong888 - an hour ago

[dead]

krunger - 3 hours ago

[dead]

aaron695 - 3 hours ago

[dead]

gertlabs - 3 hours ago

[flagged]

themafia - 3 hours ago

The question is: "Do you want to be holding a Mossberg or a Beretta?"

aussiegreenie - 3 hours ago

It is not running on either but Seedance, so who cares?