Magistral — the first reasoning model by Mistral AI
mistral.ai935 points by meetpateltech 7 days ago
935 points by meetpateltech 7 days ago
I made some GGUFs for those interested in running them at https://huggingface.co/unsloth/Magistral-Small-2506-GGUF
ollama run hf.co/unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL
or
./llama.cpp/llama-cli -hf unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL --jinja --temp 0.7 --top-k -1 --top-p 0.95 -ngl 99
Please use --jinja for llama.cpp and use temperature = 0.7, top-p 0.95!
Also best to increase Ollama's context length to say 8K at least: OLLAMA_CONTEXT_LENGTH=8192 ollama serve &. Some other details in https://docs.unsloth.ai/basics/magistral
Their benchmarks are interesting. They are comparing to DeepSeek-V3's (non-reasoning) December and DeepSeek-R1's January releases. I feel that comparing to DeepSeek-R1-0528 would be more fair.
For example, R1 scores 79.8 on AIME 2024, R1-0528 performs 91.4.
R1 scores 70 on AIME 2025, R1-0528 scores 87.5. R1-0528 does similarly better for GPQA Diamond, LiveCodeBench, and Aider (about 10-15 points higher).
I presume that "outdated upon release" benchmarks like these happen because the benchmark and the models in it were chosen first, before the model was created; and the model's development progress was measured using the benchmark. It then doesn't occur to anyone that the benchmark the engineers had been relying upon isn't also a good/useful benchmark for marketing upon release. From the inside view, it's just a benchmark, already there, already achieving impressive results, a whole-company internal target to hit for months — so why not publish it?
Would also be interesting to compare with R1-0528-Qwen3-8B (chain-of-thought distilled from Deepseek-R1-0528 and post-trained into Qwen3-8B). It scores 86 and 76 on AIME 2024 and 2025 respectively.
Currently running the 6-bit XL quant on a single old RTX 2080 Ti and I'm quite impressed TBH. Simply wild for a sub-8GB download.
I have the same card on my machine at home, what is your config to run the model?
Downloaded the gguf file by unsloth, ran llama-cli from llama.cpp with that file as an argument.
IIUC, nowadays there is a jinja templated metadata-struct inside the gguf file itself. This contains the chat template and other config.
Their paper https://mistral.ai/static/research/magistral.pdf is also cool! They edited GRPO via:
1. Removed KL Divergence
2. Normalize by total length (Dr. GRPO style)
3. Minibatch normalization for advantages
4. Relaxing trust region
Does anyone know why they added minibatch advantage normalization (or when it can be useful)?
The paper they cite "What matters in on-policy RL" claims it does not lead to much difference on their suite of test problems, and (mean-of-minibatch)-normalization doesn't seem theoretically motivated for convergence to the optimal policy?
Tbh I'm unsure as well I took a skim of the paper so if I find anything I'll post it here!
> Removed KL Divergence
Wait, how are they computing the loss?
Oh it's the KL term sorry - beta * KL ie they set beta to 0.
The goal of it was to "force" the model not to stray to far away from the original checkpoint, but it can hinder the model from learning new things
It's become trendy to delete it. I say trendy because many papers delete it without offering any proof that it is meaningless
At the risk of dating myself; Unsloth is the Bomb-dot-com!!! I use your models all the time and they just work. Thank you!!! What does llama.cpp normally use if not "jinja" for their templates?
Oh thanks! Yes I was gonna bring it up to them! Imo if there is a chat template, by default it should be --jinja
too much thinking
https://gist.github.com/gavi/b9985f730f5deefe49b6a28e5569d46...
My impression from running the first R1 release locally was that it also does too much thinking.
Magistral Small seems wayyy too heavy-handed with its RL to me:
\boxed{Hey! How can I help you today?}
They clearly rewarded the \boxed{...} formatting during their RL training, since it makes it easier to naively extract answers to math problems and thus verify them. But Magistral uses it for pretty much everything, even when it's inappropriate (in my own testing as well).
It also forgets to <think> unless you use their special system prompt reminding it to.
Honestly a little disappointing. It obviously benchmarks well, but it seems a little overcooked on non-benchmark usage.
It does not do any thinking. It is a statistical model, just like the rest of them.
"Thinking" is a term of art referring to the hidden/internal output of "reasoning" models where they output "chain of thought" before giving an answer[1]. This technique and name stem from the early observation that LLMs do better when explicitly told to "think step by step"[2]. Hope that helps clarify things for you for future constructive discussion.
We are aware of the term of art.
The point that was trying to be made, which I agree with, is that anthropomorphizing a statistical model isn’t actually helpful. It only serves to confuse laypersons into assuming these models are capable of a lot more than they really are.
That’s perfect if you’re a salesperson trying to dump your bad AI startup onto the public with an IPO, but unhelpful for pretty much any other reason, especially true understanding of what’s going on.
If that was their point, it would have been more constructive to actually make it.
To your point, it's only anthropomorphization if you make the anthrocentric assumption that "thinking" refers to something that only humans can do.[1]
And I don't think it confuses laypeople, when literally telling it to "think" achieves the very similar results as in humans - it produces output that someone provided it out-of-context would easily identify as "thinking out loud", and improves the accuracy of results like how... thinking does.
The best mental model of RLHF'd LLMs that I've seen is that they are statistical models "simulating"[1] how a human-like character would respond to a given natural-language input. To calculate the statistically "most likely" answer that an intelligent creature would give to a non-trivial question, with any sort of accuracy, you need emergent effects which look an awful like like a (low fidelity) simulation of intelligence. This includes simulating "thought". (And the distinction between "simulating thinking" and "thinking" is a distinction without a difference given enough accuracy)
I'm curious as to what "capabilities" you think the layperson is misled about, because if anything they tend to exceed layperson understanding IME. And I'm curious what mental model you have of LLMs that provides more "true understanding" of how a statistical model can generate answers that appear nowhere in its training.
[1] It also begs the question of whether there exists a clear and narrow definition of what "thinking" is that everyone can agree on. I suspect if you ask five philosophers you'll get six different answers, as the saying goes.
> It also begs the question of whether there exists a clear and narrow definition of what "thinking" is that everyone can agree on. I suspect if you ask five philosophers you'll get six different answers, as the saying goes.
And yet we added a hand wavy 7th to humanize a peice of technology.
It's a misleading "term of art" which is more accurately described as a "term of marketing". Reasoning is precisely what LLMs don't do and it's precisely why they are unsuited to many tasks they are peddled for.
How are you defining "reasoning" such that you are confident that LLMs are definitely not doing it? What evidence do you have to that effect? (And are you certain that none of your reasoning applies to humans as well?)
They don’t ”think”.
https://arxiv.org/abs/2503.09211
They don’t ”reason”.
https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin...
They don’t even always output their internal state accurately.
> https://arxiv.org/abs/2503.09211
I am thoroughly unimpressed by this paper. It sets up a vague strawman definition of "thinking" that I'm not aware of anyone using (and makes no claim it applies to humans) and then knocks down the strawman.
It also leans way too heavy on determinism - For one thing, we have no way of knowing if human brains are deterministic (until we solve whether reality itself is). For another, I doubt you would suddenly reverse your position if we created a LoRa composed of atmospheric noise, so it does not support your real position.
> https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin...
This one is more substantial, but:
"While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. [...] Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities. [...] We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles."
Starts by saying "we actually don't understand them" (meaning we don't know well enough to give a yes or no) and then proceeds to list flaws that, as I keep saying, also can be applied to most (if not all) humans' ability to reason. Human reasoning also collapses in accuracy above a certain complexities, and certainly are observed to fail to use explicit algorithms, as well as reasoning inconsistently across puzzles.
So unless your definition of anthropomorphization excludes most humans, this is far from a slam dunk.
> They don’t even always output their internal state accurately.
I have some really bad news about humans for you. I believe (Buddha et al, 500 BCE) is the foundational text on this, but there's been some more recent research (Hume, 1739), (Kierkegaard, 1849)