DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning [pdf]

74 points by fspeech 3 hours ago

Is everyone just glossing over the first place score of 118/120 on the Putnam?! I mean we'll see how it does on the upcoming 2025 test, but that's insane!

We've seen absolutely ridiculous progress in model capability over the past year (which is also quite terrifying).

gunalx - 19 minutes ago

If i read it right it used multiple samples of itself to verify the aqccuracy, but isnt this problematic?

zaxioms - 3 hours ago

It's cool, but I genuinely cannot fathom why they are targeting natural language proofs instead of a proof assistant.

natrys - 15 minutes ago

Well they do that too: https://huggingface.co/deepseek-ai/DeepSeek-Prover-V2-671B
But I suppose the bigger goal remains improving their language model, and this was an experimentation born from that. These works are symbiotic; the original DeepSeekMath resulted in GRPO, which eventually formed the backbone of their R1 model: https://arxiv.org/abs/2402.03300
blazespin - 14 minutes ago

More training data on advanced math. Lean is cool, but it's mostly about formalizing stuff we already know.
mamami - 2 hours ago

Natural language is a lot more, well, readable than say lean. You get a lot less intuition and understanding of what the model is attempting to do in the first place.

awei - 3 hours ago

Something weird here, why is it so hard to have a deterministic program capable of checking a proof or anything math related, aren't maths super deterministic when natural language is not. From first principles, it should be possible to do this without a llm verifier.

JacobiX - 2 hours ago

I think that mathematical proofs, as they are actually written, rely on natural language and on a large amount of implicit shared knowledge. They are not formalized in the Principia Mathematica sense, and they are even further from the syntax required by modern theorem provers. Even the most rigorous proofs such as those in Bourbaki are not directly translatable into a fully formal system.
blazespin - 13 minutes ago

Verifying math requires something like Lean which is a huge bottleneck, as the paper explains.
Plus there isn't a lot of training data in lean.
Most gains come from training on stuff already out there, not really the RLVR part which just amps it up a bit.
xemdetia - 2 hours ago

Maths can be super deterministic but often difficult to compute because of concepts like inferring by induction. I had to personally unlearn and rebase my understanding of math based in computation to 'get' pure maths. Another example is set building. You often don't need to compute the existence of members of sets in pure math you just need to agree that there are some members of a set that meet the criteria. How many or how many things that aren't in the set aren't meaningful often times to accept something and move on with the proof. From the computing perspective this can be difficult to put together.
jebarker - 2 hours ago

I haven’t read the paper yet, but I’d imagine the issue is converting the natural language generated by the reasoner into a form where a formal verifier can be applied.
riku_iki - 2 hours ago

such high performance program indeed could potentially be superior, if it would exist (this area is very undeveloped, there is no existing distributed well established solution which could handle large domain) and math would be formalized in that program's dsl, which also didn't happen yet.

photon_lines - 3 hours ago

Exciting stuff from a fantastic team.

newyankee - an hour ago

That is amazing if they can do all of this at < 10 % of the cost of frontier labs. Off course they work in the shadows of the great work done in the frontier labs and shared, but there is some exceptional high speed execution happening behind the scenes that shows this is clearly a race, but a race where China is happy to be #2 as long as the gap is not significant and the costs are reasonable

- 2 hours ago

[deleted]

agentultra - 2 hours ago

So it's designed for informal proofs and it "verifies" based on a rubric fitting function and human interaction, is that right?

What's the use case for a system like this?

blazespin - 12 minutes ago

Advanced math solving, as the results indicate. Informal proof reasoning is advancing faster than formal proof reasoning because the latter is slow and compute intensive.
I suspect it's also because there isn't a lot of data to train on.