Phi-4 Bug Fixes

unsloth.ai

193 points by danielhanchen 6 months ago


danielhanchen - 6 months ago

Hey HN family! I found a few bugs for Phi-4 - Microsoft's latest MIT licensed LLM to be on par with GPT-4o mini

1. End of sentence should be <|im_end|> not <|endoftext|>

2. Chat template should not auto add an assistant prompt

3. Padding token should not be EOS but <|dummy_87|>

I also converted Phi-4 to Llama-arch. I uploaded GGUFs, 4bit quants, dynamic quants and all fixes to https://huggingface.co/unsloth

I also made a Colab notebook to finetune Phi-4 on a free GPU: https://colab.research.google.com/github/unslothai/notebooks...

danielhanchen - 6 months ago

Update: The Phi-4 team is actively working on adding all our fixes into the original model! https://huggingface.co/microsoft/phi-4/discussions/21

RandyOrion - 6 months ago

Hi. It's nice to see these fixes.

I got a question after checking results on the open LLM leaderboard[1].

Comparing the result of NyxKrage/Microsoft_Phi-4 and microsoft/phi-4 or unsloth/phi-4, I can see fixing both the tokenizer and chat template causes the performance of both IFEval and BBH to increase. However, the performance on MATH, GPQA and MUSR degrades A LOT.

Is there any explanation on why this is happening?

[1] https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...

t1amat - 6 months ago

Daniel’s fixes to Phi-4 make it the best scoring Phi-4 on HF’s Open LLM Leaderboard. Great job on that.

Unsloth is a masterpiece, keep up the great work!

lostmsu - 6 months ago

The benchmark results of the model before and after the "fixes" do not match numbers reported in the model card: https://huggingface.co/microsoft/phi-4

According to Microsoft MATH score should be 80.4, while both original and the "fixed" models as run by unsloth only score just over 12.3. So either Microsoft made a few huge mistakes, or unsloth was not able to run their model correctly.

dorian-graph - 6 months ago

These seem like amazingly egregious mistakes MS made? Or is it not as bad as it seems? I suppose, I'm curious how these kinds of mistakes happen for a model release.

excerionsforte - 6 months ago

Available on Ollama already: https://ollama.com/vanilj/phi-4-unsloth

NooneAtAll3 - 6 months ago

Application Error

TypeError: m(...).findLast is not a function

at L (https://unsloth.ai/assets/root-DexjOeLv.js:1:340)

at ia (https://unsloth.ai/assets/components-D38fXVcE.js:7:30549)

at Ac (https://unsloth.ai/assets/components-D38fXVcE.js:7:98661)

at Am (https://unsloth.ai/assets/components-D38fXVcE.js:7:94250)

at o0 (https://unsloth.ai/assets/components-D38fXVcE.js:7:93401)

at ha (https://unsloth.ai/assets/components-D38fXVcE.js:7:93212)

at Mm (https://unsloth.ai/assets/components-D38fXVcE.js:7:90555)

at Om (https://unsloth.ai/assets/components-D38fXVcE.js:7:89963)

at MessagePort.M (https://unsloth.ai/assets/components-D38fXVcE.js:1:11235

adultSwim - 6 months ago

Are there alternatives to unsloth?

I would love to use it but the open/free version only handles one GPU, and it's unclear how much the paid version would cost. I have some limited access to multiple older NVidia cards and would love to make better use of them while I'm still learning. My budget for learning/projects is rather modest.

Hopefully they succeed. At work I could make a strong case for going with them as they allow keeping data local only, instead of relying on an API.

greensh - 6 months ago

Microsoft developed and trained Phi-4. How can there be bugs in their official implementation? Does this mean they trained und evaluated it on their own completly different code and then ported it to the huggingface library for compatibility?

sinuhe69 - 6 months ago

How big is GPT4o-mini? Some sources say it's 8b big, but I guess they have different models with different sizes. But if GPT4o-mini is just 8b, I don't see the point of a "distilled" model, which requires a much bigger network but still not on par with the original. Because it's open source?

make3 - 6 months ago

"Yes it improves performance!" proceeds to show the most unconvincing stats ever

you can probably blow on your GPU and get a similar performance change

c1b - 6 months ago

daniel youre a legend, thanks for all you do!

one question, I see perf comparisons here are done on an L4, but isn't this SKU very rare? Im used to T4 at that tier

m3kw9 - 6 months ago

But fixing a model is the first I’ve heard of.

TZubiri - 6 months ago

Ah yes, drawing ASCII art, the de facto benchmark for evaluating LLM quality.

wsintra2022 - 6 months ago

>Reddit comments show our fixes make Phi-4 inference much better

I’d like to try ‘Reddit comments show my fixes make app better’ in my next review