Phi-4 Bug Fixes

unsloth.ai

191 points by danielhanchen 4 days ago


danielhanchen - 4 days ago

Hey HN family! I found a few bugs for Phi-4 - Microsoft's latest MIT licensed LLM to be on par with GPT-4o mini

1. End of sentence should be <|im_end|> not <|endoftext|>

2. Chat template should not auto add an assistant prompt

3. Padding token should not be EOS but <|dummy_87|>

I also converted Phi-4 to Llama-arch. I uploaded GGUFs, 4bit quants, dynamic quants and all fixes to https://huggingface.co/unsloth

I also made a Colab notebook to finetune Phi-4 on a free GPU: https://colab.research.google.com/github/unslothai/notebooks...

danielhanchen - 3 days ago

Update: The Phi-4 team is actively working on adding all our fixes into the original model! https://huggingface.co/microsoft/phi-4/discussions/21

RandyOrion - 3 days ago

Hi. It's nice to see these fixes.

I got a question after checking results on the open LLM leaderboard[1].

Comparing the result of NyxKrage/Microsoft_Phi-4 and microsoft/phi-4 or unsloth/phi-4, I can see fixing both the tokenizer and chat template causes the performance of both IFEval and BBH to increase. However, the performance on MATH, GPQA and MUSR degrades A LOT.

Is there any explanation on why this is happening?

[1] https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...

t1amat - 4 days ago

Daniel’s fixes to Phi-4 make it the best scoring Phi-4 on HF’s Open LLM Leaderboard. Great job on that.

Unsloth is a masterpiece, keep up the great work!

lostmsu - 4 days ago

The benchmark results of the model before and after the "fixes" do not match numbers reported in the model card: https://huggingface.co/microsoft/phi-4

According to Microsoft MATH score should be 80.4, while both original and the "fixed" models as run by unsloth only score just over 12.3. So either Microsoft made a few huge mistakes, or unsloth was not able to run their model correctly.

dorian-graph - 3 days ago

These seem like amazingly egregious mistakes MS made? Or is it not as bad as it seems? I suppose, I'm curious how these kinds of mistakes happen for a model release.

excerionsforte - 3 days ago

Available on Ollama already: https://ollama.com/vanilj/phi-4-unsloth

NooneAtAll3 - 3 days ago

Application Error

TypeError: m(...).findLast is not a function

at L (https://unsloth.ai/assets/root-DexjOeLv.js:1:340)

at ia (https://unsloth.ai/assets/components-D38fXVcE.js:7:30549)

at Ac (https://unsloth.ai/assets/components-D38fXVcE.js:7:98661)

at Am (https://unsloth.ai/assets/components-D38fXVcE.js:7:94250)

at o0 (https://unsloth.ai/assets/components-D38fXVcE.js:7:93401)

at ha (https://unsloth.ai/assets/components-D38fXVcE.js:7:93212)

at Mm (https://unsloth.ai/assets/components-D38fXVcE.js:7:90555)

at Om (https://unsloth.ai/assets/components-D38fXVcE.js:7:89963)

at MessagePort.M (https://unsloth.ai/assets/components-D38fXVcE.js:1:11235

adultSwim - 3 days ago

Are there alternatives to unsloth?

I would love to use it but the open/free version only handles one GPU, and it's unclear how much the paid version would cost. I have some limited access to multiple older NVidia cards and would love to make better use of them while I'm still learning. My budget for learning/projects is rather modest.

Hopefully they succeed. At work I could make a strong case for going with them as they allow keeping data local only, instead of relying on an API.

greensh - 3 days ago

Microsoft developed and trained Phi-4. How can there be bugs in their official implementation? Does this mean they trained und evaluated it on their own completly different code and then ported it to the huggingface library for compatibility?

sinuhe69 - 3 days ago

How big is GPT4o-mini? Some sources say it's 8b big, but I guess they have different models with different sizes. But if GPT4o-mini is just 8b, I don't see the point of a "distilled" model, which requires a much bigger network but still not on par with the original. Because it's open source?

make3 - 3 days ago

"Yes it improves performance!" proceeds to show the most unconvincing stats ever

you can probably blow on your GPU and get a similar performance change

c1b - 3 days ago

daniel youre a legend, thanks for all you do!

one question, I see perf comparisons here are done on an L4, but isn't this SKU very rare? Im used to T4 at that tier

m3kw9 - 3 days ago

But fixing a model is the first I’ve heard of.

TZubiri - 3 days ago

Ah yes, drawing ASCII art, the de facto benchmark for evaluating LLM quality.

wsintra2022 - 3 days ago

>Reddit comments show our fixes make Phi-4 inference much better

I’d like to try ‘Reddit comments show my fixes make app better’ in my next review