MTG Bench: Testing how well LLMs can play Magic

mtgautodeck.com

31 points by CallumFerg 11 hours ago


derac - 28 minutes ago

I think running them against each other with a rules engine would be more interesting. Count up illegal moves and wins/unfinished games. I think llm grading is too unreliable.

thurn - 43 minutes ago

To clarify, the more accurate description would be "Testing how well LLMs can follow the rules of Magic", right? There is no actual evaluation of how "well" they are playing?

josh_p - 2 hours ago

I know the author specifically did not use a rules engine in their simulation because of uncertainty on how it would affect it.

I do still wonder if adapting something like card forge for llm use would result in engaging gameplay with an llm.

https://github.com/Card-Forge/forge

OsrsNeedsf2P - 3 hours ago

I love obscure benchmarks, and I feel like I can trust their results a lot more - afterall, they (probably) weren't benchmaxxed. RuneBench[0] is another good example (how well LLMs can play Runescape)

[0] https://maxbittker.github.io/runebench/

purple-leafy - an hour ago

Benchmarks like this are onto something. Next frontier of llm benchmarking

jmccaf - 2 hours ago

Awesome ! Does this use https://mage-bench.com/ , or is it a separate project? I ran 4 local models in a tournament recently with mage-bench on an RTX 5090 ; Qwen 3.6 27B won narrowly over Gemma 4 .

OwenCR - 2 hours ago

Sadly this benchmark removes the part of MTG that is most interesting: the opponent(s). Without opponents you simply don't have a game. You just have a rules engine - quite boring!

I think I object more to the decks used in testing than the machines' decisions. I do have nit picks though: This hand is quite poor and should be mulliganned: https://app.mtgautodeck.com/public/benchmarks/4bd9955b-ebe1-.... The poor runout reinforces this decision.

This project is cool though, props for making it!

danbrooks - 3 hours ago

Very cool. I’ve been daydreaming about whether LLMs can be used to reason through gaming decisions.

TZubiri - 2 hours ago

Looking forward to this metric being Goodhart lawed.

Like how the strawberry example was overtrained for, or how the pelican on a bike started being used in official release posts.

pilord314 - an hour ago

They should randomize games of judge tower and see who wins:

https://mtg.fandom.com/wiki/Judge_Tower