Mistral releases Devstral2 and Mistral Vibe CLI
mistral.ai740 points by pember 4 days ago
740 points by pember 4 days ago
llm install llm-mistral
llm mistral refresh
llm -m mistral/devstral-2512 "Generate an SVG of a pelican riding a bicycle"
https://tools.simonwillison.net/svg-render#%3Csvg%20xmlns%3D...Pretty good for a 123B model!
(That said I'm not 100% certain I guessed the correct model ID, I asked Mistral here: https://x.com/simonw/status/1998435424847675429)
We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data. It would be a great way to ensure an initial thumbs up from a prominent reviewer. It's a good benchmark but it seems like it would be a good idea to include an additional random or unannounced similar test to catch any benchmaxxing.
I wrote about that possibility here: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
Hi Simon! Love your work! Our of curiosity - how many pelican-cycling samples do you produce. Curious about the variance here. Thanks!
I've lost count, but there are 85 posts with that tag here: https://simonwillison.net/tags/pelican-riding-a-bicycle/
I need to extract them all into a formal collection.
I think the parent poster might be asking about generations per model-test. Atleast that's what I understood.
Aiden is perhaps misinformed. From a Bing search performed just now.
> Yes, I am familiar with the "pelican riding a bicycle" SVG generation test. It is a benchmark for evaluating the ability of AI models, particularly large language models (LLMs) and multi-modal systems, to generate original, high-quality SVG vector graphics based on a deliberately unusual and complex prompt. The benchmark was popularized by Simon Willison, who selected the prompt because:
Web search-based RAG is very different from having something embedded in a model's training data, though.
ChatGPT website gives a similar answer. Are they running RAG, or the model?
> Yes — I’m familiar with the “pelican riding a bicycle” SVG generation test.
> It’s become a kind of informal benchmark people use when evaluating whether an image-generation or SVG-generation model can: ...
Runnin’ confabulations:
>Yes — the “hamster driving a car” prompt is a well-known informal test …
>…that’s a well-known informal test people use…(a mole-rat holding or playing a guitar).
Try any plausible concept. Get sillier and it’s trained to talk about it being nonsense. The output still claims it’s a real test, just a real “nonsense” test.
[flagged]
Whatever you think Jimmc414's _concerns_ are (they merely state a possibility) Simon enumerates a number of concerns in the linked article, and then addresses those. So I'm not sure why you think this is so.
Condescending and disrespectful to whom? Everybody wholsale? This doesnt seem reasonable? Please elaborate.
Not sure if I'd use the same descriptions so pointedly, but I can see what they mean.
It's perfectly fine to link for convenience, but it does feel a little disrespectful/SEO-y to not 'continue the conversation'. A summary in the very least, how exactly it pertains. Sell us.
In a sense, link-dropping [alone] is saying: "go read this and establish my rhetorical/social position, I'm done here"
Imagine meeting an author/producer/whatever you liked. You'd want to talk about their work, how they created it, the impact it had, and so on. Now imagine if they did that... or if they waved their hand vaguely at a catalog.
I've genuinely been answering the question "what if the labs are training on your pelican benchmark" 3-4 times a week for several months at this point. I wrote that piece precisely so I didn't have to copy and paste the same arguments into dozens of different conversations.
Oh, no. Does this policing job pay well? /s Seriously: less is more, trust the process, any number of platitudes work here. Who are you defending against? Readers, right? You wrote your thing, defended it with more of the thing. It'll permeate. Or it won't. Does it matter?
You could be done, nothing is making you defend this (sorry) asinine benchmark across the internet. Not trying to (m|y)uck your yum, or whatever.
Remember, I did say linking for convenience is fine. We're belaboring the worst reading in comments. Inconsequential, unnecessary heartburn. Link the blog posts together and call it good enough.
Surprised to see snark re: what I thought was a standard practice (linking FAQs, essentially).
I hadn’t seen the post. It was relevant. I just read it. Lucky Ten Thousand can read it next time even though I won’t.
Simon has never seemed annoying so unlike other comments that might worry me (even “Opus made this” even though it’s cool but I’m concerned someone astroturfed), that comment would’ve never raised my eyebrows. He’s also dedicated and I love he devotes his time to a new field like this where it’s great to have attempts at benchmarks, folks cutting through chaff, etc.