Magistral — the first reasoning model by Mistral AI

935 points by meetpateltech 7 days ago

I made some GGUFs for those interested in running them at https://huggingface.co/unsloth/Magistral-Small-2506-GGUF

ollama run hf.co/unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL

./llama.cpp/llama-cli -hf unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL --jinja --temp 0.7 --top-k -1 --top-p 0.95 -ngl 99

Please use --jinja for llama.cpp and use temperature = 0.7, top-p 0.95!

Also best to increase Ollama's context length to say 8K at least: OLLAMA_CONTEXT_LENGTH=8192 ollama serve &. Some other details in https://docs.unsloth.ai/basics/magistral

ozgune - 7 days ago

Their benchmarks are interesting. They are comparing to DeepSeek-V3's (non-reasoning) December and DeepSeek-R1's January releases. I feel that comparing to DeepSeek-R1-0528 would be more fair.
For example, R1 scores 79.8 on AIME 2024, R1-0528 performs 91.4.
R1 scores 70 on AIME 2025, R1-0528 scores 87.5. R1-0528 does similarly better for GPQA Diamond, LiveCodeBench, and Aider (about 10-15 points higher).
https://huggingface.co/deepseek-ai/DeepSeek-R1-0528
- derefr - 5 days ago
  
  I presume that "outdated upon release" benchmarks like these happen because the benchmark and the models in it were chosen first, before the model was created; and the model's development progress was measured using the benchmark. It then doesn't occur to anyone that the benchmark the engineers had been relying upon isn't also a good/useful benchmark for marketing upon release. From the inside view, it's just a benchmark, already there, already achieving impressive results, a whole-company internal target to hit for months — so why not publish it?
- semi-extrinsic - 6 days ago
  
  Would also be interesting to compare with R1-0528-Qwen3-8B (chain-of-thought distilled from Deepseek-R1-0528 and post-trained into Qwen3-8B). It scores 86 and 76 on AIME 2024 and 2025 respectively.
  Currently running the 6-bit XL quant on a single old RTX 2080 Ti and I'm quite impressed TBH. Simply wild for a sub-8GB download.
  - saratogacx - 6 days ago
    
    I have the same card on my machine at home, what is your config to run the model?
    
    semi-extrinsic - 6 days ago
    
    Downloaded the gguf file by unsloth, ran llama-cli from llama.cpp with that file as an argument.
    IIUC, nowadays there is a jinja templated metadata-struct inside the gguf file itself. This contains the chat template and other config.
  - danielhanchen - 6 days ago
    
    I'm surprised it does very well as well - that's pretty cool to see!
danielhanchen - 7 days ago

Their paper https://mistral.ai/static/research/magistral.pdf is also cool! They edited GRPO via:
1. Removed KL Divergence
2. Normalize by total length (Dr. GRPO style)
3. Minibatch normalization for advantages
4. Relaxing trust region
- gyrovagueGeist - 7 days ago
  
  Does anyone know why they added minibatch advantage normalization (or when it can be useful)?
  The paper they cite "What matters in on-policy RL" claims it does not lead to much difference on their suite of test problems, and (mean-of-minibatch)-normalization doesn't seem theoretically motivated for convergence to the optimal policy?
  - danielhanchen - 6 days ago
    
    Tbh I'm unsure as well I took a skim of the paper so if I find anything I'll post it here!
- Onavo - 7 days ago
  
  > Removed KL Divergence
  Wait, how are they computing the loss?
  - danielhanchen - 7 days ago
    
    Oh it's the KL term sorry - beta * KL ie they set beta to 0.
    The goal of it was to "force" the model not to stray to far away from the original checkpoint, but it can hinder the model from learning new things
  - trc001 - 6 days ago
    
    It's become trendy to delete it. I say trendy because many papers delete it without offering any proof that it is meaningless
  - mjburgess - 7 days ago
    
    It's just a penalty term that they delete
monkmartinez - 7 days ago

At the risk of dating myself; Unsloth is the Bomb-dot-com!!! I use your models all the time and they just work. Thank you!!! What does llama.cpp normally use if not "jinja" for their templates?
- danielhanchen - 6 days ago
  
  Oh thanks! Yes I was gonna bring it up to them! Imo if there is a chat template, by default it should be --jinja
gavi - 6 days ago

too much thinking
https://gist.github.com/gavi/b9985f730f5deefe49b6a28e5569d46...
- fzzzy - 6 days ago
  
  My impression from running the first R1 release locally was that it also does too much thinking.
  - reissbaker - 6 days ago
    
    Magistral Small seems wayyy too heavy-handed with its RL to me:
    \boxed{Hey! How can I help you today?}
    They clearly rewarded the \boxed{...} formatting during their RL training, since it makes it easier to naively extract answers to math problems and thus verify them. But Magistral uses it for pretty much everything, even when it's inappropriate (in my own testing as well).
    It also forgets to <think> unless you use their special system prompt reminding it to.
    Honestly a little disappointing. It obviously benchmarks well, but it seems a little overcooked on non-benchmark usage.
  - cluckindan - 6 days ago
    
    It does not do any thinking. It is a statistical model, just like the rest of them.
    
    LordDragonfang - 6 days ago
    
    "Thinking" is a term of art referring to the hidden/internal output of "reasoning" models where they output "chain of thought" before giving an answer[1]. This technique and name stem from the early observation that LLMs do better when explicitly told to "think step by step"[2]. Hope that helps clarify things for you for future constructive discussion.
    [1] https://arxiv.org/html/2410.10630v1
    [2] https://arxiv.org/pdf/2205.11916
    
    bobsomers - 6 days ago
    
    We are aware of the term of art.
    The point that was trying to be made, which I agree with, is that anthropomorphizing a statistical model isn’t actually helpful. It only serves to confuse laypersons into assuming these models are capable of a lot more than they really are.
    That’s perfect if you’re a salesperson trying to dump your bad AI startup onto the public with an IPO, but unhelpful for pretty much any other reason, especially true understanding of what’s going on.
    
    LordDragonfang - 6 days ago
    
    If that was their point, it would have been more constructive to actually make it.
    To your point, it's only anthropomorphization if you make the anthrocentric assumption that "thinking" refers to something that only humans can do.[1]
    And I don't think it confuses laypeople, when literally telling it to "think" achieves the very similar results as in humans - it produces output that someone provided it out-of-context would easily identify as "thinking out loud", and improves the accuracy of results like how... thinking does.
    The best mental model of RLHF'd LLMs that I've seen is that they are statistical models "simulating"[1] how a human-like character would respond to a given natural-language input. To calculate the statistically "most likely" answer that an intelligent creature would give to a non-trivial question, with any sort of accuracy, you need emergent effects which look an awful like like a (low fidelity) simulation of intelligence. This includes simulating "thought". (And the distinction between "simulating thinking" and "thinking" is a distinction without a difference given enough accuracy)
    I'm curious as to what "capabilities" you think the layperson is misled about, because if anything they tend to exceed layperson understanding IME. And I'm curious what mental model you have of LLMs that provides more "true understanding" of how a statistical model can generate answers that appear nowhere in its training.
    [1] It also begs the question of whether there exists a clear and narrow definition of what "thinking" is that everyone can agree on. I suspect if you ask five philosophers you'll get six different answers, as the saying goes.
    [2] https://www.astralcodexten.com/p/janus-simulators
    
    zer00eyz - 6 days ago
    
    > It also begs the question of whether there exists a clear and narrow definition of what "thinking" is that everyone can agree on. I suspect if you ask five philosophers you'll get six different answers, as the saying goes.
    And yet we added a hand wavy 7th to humanize a peice of technology.
    
    andrepd - 6 days ago
    
    It's a misleading "term of art" which is more accurately described as a "term of marketing". Reasoning is precisely what LLMs don't do and it's precisely why they are unsuited to many tasks they are peddled for.
    
    LordDragonfang - 6 days ago
    
    How are you defining "reasoning" such that you are confident that LLMs are definitely not doing it? What evidence do you have to that effect? (And are you certain that none of your reasoning applies to humans as well?)
    
    cluckindan - 6 days ago
    
    They don’t ”think”.
    https://arxiv.org/abs/2503.09211
    They don’t ”reason”.
    https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin...
    They don’t even always output their internal state accurately.
    https://arxiv.org/abs/2505.05410
    
    LordDragonfang - 5 days ago
    
    > https://arxiv.org/abs/2503.09211
    I am thoroughly unimpressed by this paper. It sets up a vague strawman definition of "thinking" that I'm not aware of anyone using (and makes no claim it applies to humans) and then knocks down the strawman.
    It also leans way too heavy on determinism - For one thing, we have no way of knowing if human brains are deterministic (until we solve whether reality itself is). For another, I doubt you would suddenly reverse your position if we created a LoRa composed of atmospheric noise, so it does not support your real position.
    > https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin...
    This one is more substantial, but:
    "While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. [...] Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities. [...] We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles."
    Starts by saying "we actually don't understand them" (meaning we don't know well enough to give a yes or no) and then proceeds to list flaws that, as I keep saying, also can be applied to most (if not all) humans' ability to reason. Human reasoning also collapses in accuracy above a certain complexities, and certainly are observed to fail to use explicit algorithms, as well as reasoning inconsistently across puzzles.
    So unless your definition of anthropomorphization excludes most humans, this is far from a slam dunk.
    > They don’t even always output their internal state accurately.
    I have some really bad news about humans for you. I believe (Buddha et al, 500 BCE) is the foundational text on this, but there's been some more recent research (Hume, 1739), (Kierkegaard, 1849)