Advancing AI Benchmarking with Game Arena

blog.google

125 points by salkahfi 16 hours ago


ofirpress - 15 hours ago

This is a good way to benchmark models. We [the SWE-bench team] took the meta-version of this and implemented it as a new benchmark called CodeClash -

We have agents implement agents that play games against each other- so Claude isn't playing against GPT, but an agent written by Claude plays poker against an agent written by GPT, and this really tough task leads to very interesting findings on AI for coding.

https://codeclash.ai/

jjcm - 5 hours ago

This was effectively what OpenAI did in the very early days with Dota 2: https://en.wikipedia.org/wiki/OpenAI_Five

As someone who's been playing dota for nearly 20 years now, it was fascinating to watch it play. Some of it's decision making process didn't seem logical in the short term, but would often be set ups for future plays, even though their observation window was fairly small. Even more impressively was the ai bot changed the meta of professional players, since tactics that arose out of its training ended up being more optimal.

I wish we got to the point where other ai bots were out there, but it's entirely understandable that you couldn't drive a complex game like Dota with LLMs, whereas you can with the ones the Game Arena has selected.

kenforthewin - 11 hours ago

Let's add NetHack to the mix!

https://kenforthewin.github.io/blog/posts/nethack-agent/

iNic - 12 hours ago

I feel uneasy about werewolf being included here. I don't want AI labs to actively try and make their LLMs deceptive!

mohsen1 - 10 hours ago

Oh hey, I've been running Werewolves/Mafia games as benchmarks for a while now

https://mafia-arena.com

Gemini is consistently winning against top models

ZeroCool2u - 14 hours ago

I'd really like to see them add a complex open world fully physicalized game like Star Citizen (assuming the game itself is stable) with a single primary goal like accumulating currency as a measure of general autonomy and a proxy for how the model might behave in the real world given access to a bipedal robot.

deyiao - 8 hours ago

I believe that if a model can outperform humans in all board/card games, and can autonomously complete all video games, then AGI — or even ASI — has essentially been achieved. We’re still a long way from that.

10xDev - 15 hours ago

If AI can program, why does it matter if it can play Chess using CoT when it can program a Chess Engine instead? This applies to other domains as well.

cv5005 - 15 hours ago

My personal threshold for AGI is when an AI can 'sit down' - it doesn't need to have robotic hands, but it needs to only use visual and audio inputs to make its moves - and complete a modern RPG or FPS single player game that it hasn't pre-trained on (it can train on older games).

simianwords - 15 hours ago

Gemini tops all benchmarks but when it comes to real world usage it is genuinely unusable

tiahura - 15 hours ago

How about nethack?

eamag - 16 hours ago

Curious why they decided to curate poker hands instead of a normal poker

mclau153 - 12 hours ago

Claude plays Pokemon Red

bennyfreshness - 14 hours ago

Wow. I'm generally in the AI maximalist camp. But adding Werewolf feels dangerous to me. Anyone who's played knows lying, deceipt, and manipulation is often key to winning. We really want models climbing this benchmark?

chaostheory - 15 hours ago

Anecdotal data point, but recently I’ve found Gemini to perform better than ChatGPT when it came to intent analysis.

PunchyHamster - 14 hours ago

making models target benchmark about being good at lying and getting away with it (werewolf) is certainly an interesting choice