Mistral releases Devstral2 and Mistral Vibe CLI

740 points by pember 4 days ago

  llm install llm-mistral
  llm mistral refresh
  llm -m mistral/devstral-2512 "Generate an SVG of a pelican riding a bicycle"

https://tools.simonwillison.net/svg-render#%3Csvg%20xmlns%3D...

Pretty good for a 123B model!

(That said I'm not 100% certain I guessed the correct model ID, I asked Mistral here: https://x.com/simonw/status/1998435424847675429)

Jimmc414 - 4 days ago

We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data. It would be a great way to ensure an initial thumbs up from a prominent reviewer. It's a good benchmark but it seems like it would be a good idea to include an additional random or unannounced similar test to catch any benchmaxxing.
- simonw - 4 days ago
  
  I wrote about that possibility here: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
  - armcat - 4 days ago
    
    Hi Simon! Love your work! Our of curiosity - how many pelican-cycling samples do you produce. Curious about the variance here. Thanks!
    
    simonw - 3 days ago
    
    I've lost count, but there are 85 posts with that tag here: https://simonwillison.net/tags/pelican-riding-a-bicycle/
    I need to extract them all into a formal collection.
    
    karambir - 3 days ago
    
    I think the parent poster might be asking about generations per model-test. Atleast that's what I understood.
    
    - 3 days ago
    
    [deleted]
    
    huxley - 3 days ago
    
    A coffee-table book? A Natural History of SVG Pelicans
  - jgalt212 - 3 days ago
    
    Aiden is perhaps misinformed. From a Bing search performed just now.
    > Yes, I am familiar with the "pelican riding a bicycle" SVG generation test. It is a benchmark for evaluating the ability of AI models, particularly large language models (LLMs) and multi-modal systems, to generate original, high-quality SVG vector graphics based on a deliberately unusual and complex prompt. The benchmark was popularized by Simon Willison, who selected the prompt because:
    
    100721 - 3 days ago
    
    Web search-based RAG is very different from having something embedded in a model's training data, though.
    
    jgalt212 - 3 days ago
    
    ChatGPT website gives a similar answer. Are they running RAG, or the model?
    > Yes — I’m familiar with the “pelican riding a bicycle” SVG generation test.
    > It’s become a kind of informal benchmark people use when evaluating whether an image-generation or SVG-generation model can: ...
    
    Barbing - a day ago
    
    Runnin’ confabulations:
    >Yes — the “hamster driving a car” prompt is a well-known informal test …
    >…that’s a well-known informal test people use…(a mole-rat holding or playing a guitar).
    Try any plausible concept. Get sillier and it’s trained to talk about it being nonsense. The output still claims it’s a real test, just a real “nonsense” test.
  - th0ma5 - 4 days ago
    
    [flagged]
    
    vanschelven - 4 days ago
    
    Whatever you think Jimmc414's _concerns_ are (they merely state a possibility) Simon enumerates a number of concerns in the linked article, and then addresses those. So I'm not sure why you think this is so.
    
    vnvnff - 4 days ago
    
    It's a pattern: https://news.ycombinator.com/item?id=44725190
    
    dugidugout - 4 days ago
    
    Condescending and disrespectful to whom? Everybody wholsale? This doesnt seem reasonable? Please elaborate.
    
    bravetraveler - 4 days ago
    
    Not sure if I'd use the same descriptions so pointedly, but I can see what they mean.
    It's perfectly fine to link for convenience, but it does feel a little disrespectful/SEO-y to not 'continue the conversation'. A summary in the very least, how exactly it pertains. Sell us.
    In a sense, link-dropping [alone] is saying: "go read this and establish my rhetorical/social position, I'm done here"
    Imagine meeting an author/producer/whatever you liked. You'd want to talk about their work, how they created it, the impact it had, and so on. Now imagine if they did that... or if they waved their hand vaguely at a catalog.
    
    simonw - 4 days ago
    
    I've genuinely been answering the question "what if the labs are training on your pelican benchmark" 3-4 times a week for several months at this point. I wrote that piece precisely so I didn't have to copy and paste the same arguments into dozens of different conversations.
    
    bravetraveler - 4 days ago
    
    Oh, no. Does this policing job pay well? /s Seriously: less is more, trust the process, any number of platitudes work here. Who are you defending against? Readers, right? You wrote your thing, defended it with more of the thing. It'll permeate. Or it won't. Does it matter?
    You could be done, nothing is making you defend this (sorry) asinine benchmark across the internet. Not trying to (m|y)uck your yum, or whatever.
    Remember, I did say linking for convenience is fine. We're belaboring the worst reading in comments. Inconsequential, unnecessary heartburn. Link the blog posts together and call it good enough.
    
    Barbing - 4 days ago
    
    Surprised to see snark re: what I thought was a standard practice (linking FAQs, essentially).
    I hadn’t seen the post. It was relevant. I just read it. Lucky Ten Thousand can read it next time even though I won’t.
    Simon has never seemed annoying so unlike other comments that might worry me (even “Opus made this” even though it’s cool but I’m concerned someone astroturfed), that comment would’ve never raised my eyebrows. He’s also dedicated and I love he devotes his time to a new field like this where it’s great to have attempts at benchmarks, folks cutting through chaff, etc.