Fine-tuning LLMs is a waste of time

codinginterviewsmadesimple.substack.com

192 points by j-wang 7 days ago


rybosome - 7 days ago

I think the point the author misses is that many applications of fine-tuning are to get a model to do a single task. This is what I have done in my current role at my company.

We’ve fine-tuned open weight models for knowledge-injection, among other things, and get a model that’s better than OpenAI models at exactly one hyper specific task for our use case, which is hardware verification. Or, fine-tuned the OAI models and get significantly better OAI models at this task, and then only use them for this task.

The point is that a network of hyper-specific fine-tuned models is how a lot of stuff is implemented. So I disagree from direct experience with the premise that fine-tuning is a waste of time because it is destructive.

I don’t care if I “damage” Llama so that it can’t write poetry, give me advice on cooking, or translate to German. In this instance I’m only ever going to prompt it with: “Does this design implement the AXA protocol? <list of ports and parameters>”

kouteiheika - 6 days ago

> Adapter Modules and LoRA (Low-Rank Adaptation) insert new knowledge through specialized, isolated subnetworks, leaving existing neurons untouched. This is best for stuff like formatting, specific chains, etc- all of which don’t require a complete neural network update.

This highlights to me that the author doesn't know what they're talking about. LoRA does exactly the same thing as normal fine-tuning, it's just a trick to make it faster and/or be able to do it on lower end hardware. LoRA doesn't add "isolated subnetworks" - LoRA parameters are added to the original weights!

Here's the equation for the forward pass from the original paper[1]:

    h = W_{0} * x + ∆W * x = W_{0} * x + B * A * x
where "W_{0}" are the original weights and "B" and "A" (which give us "∆W_{x}" after they're multiplied) are the LoRA adapter. And if you've been paying attention it should also be obvious that, mathematically, you can merge your LoRA adapter into the original weights (by doing "W = W_{0} + ∆W") which most people do, or you could even create a LoRA adapter from a fully fine-tuned model by calculating "W - W_{0}" to get ∆W and then do SVD to recover B and A.

If you know what you're doing anything you can do with LoRA you can also do with full-finetuning, but better. It might be true that it's somewhat harder to "damage" a model by doing LoRA (because the parameter updates are fundamentally low rank due to the LoRA adapters being low rank), but that's a skill issue and not a fundamental property.

[1] -- https://arxiv.org/pdf/2106.09685

kamranjon - 7 days ago

This is a pretty awful take. Everyone understands they are modifying the weights - that is the point. It’s not like these models were released with all of the weights perfectly accounted for and changing them in any way ruins them. The awesome thing about fine-tuning is that the weights are malleable and you have a great base to start from.

Also the basic premise that knowledge injection is a bad use-case seems flawed? There are countless open models released by Google that completely fly in the face of this. Medgemma is just Gemma 3 4b fine-tuned on a ton of medical datasets, and it’s measurably better than stock Gemma within the medical domain. Maybe it lost some ability to answer trivia about Minecraft in the process, but isn’t that kinda implied by “fine-tuning” something? Your making it purpose built for a specific domain.

reissbaker - 7 days ago

Clickbait headline. "Fine-tuning LLMs for knowledge injection is a waste of time" is true, but IDK who's trying to do that. Fine-tuning is great for changing model behavior (i.e. the zillions of uncensored models on Hugging Face are much more willing to respond to... dodgy... prompts than any amount of RAG is gonna get you), and RAG is great for knowledge injection.

Also... "LoRA" as a replacement for finetuning??? LoRA is a kind of finetuning! In the research community it's actually referred to as "parameter efficient finetuning." You're changing a smaller number of weights, but you're still changing them.

muzani - 7 days ago

It was the best option at one point. They're still a great option if you want an override (e.g. categorization or dialects), but they're not precise.

Changes that happened:

1. LLMs got a lot cheaper but fine tuning didn't. Fine tuning was a way to cut down on prompts and make them 0 shot (not require examples)

2. Context windows became bigger. Fine tuning was great when it was expected to respond a sentence.

3. The two things above made RAG viable.

4. Training got better on released models, to the point where 0 shots worked fine. Fine tuning ends up overriding these things that were scoring nearly full points on benchmarks.

simonw - 7 days ago

"Fine-tuning large language models (LLMs) is frequently sold as a quick, powerful method for injecting new knowledge"

Is that true though? I don't think I've seen a vendor selling that as a benefit of fine-tuning.

robrenaud - 7 days ago

There is no real difference between fine-tuning with and without a lora. If you give me a model with a lora adapter, I can give you an updated model without the extra lora params that is functionally identical.

Fitting a lora changes potentially useful information the same way that fine-tuning the whole model does. It's just the lora restricts the expressiveness of the weight update so that is compactly encoded.

ankit219 - 7 days ago

I see this and immediately relived the last two years of the journey. I think some of the mental model that helped me might help the community too.

What people expect from finetuning is knowledge addition. You want to keep the styling[1] of the original model, just add new knowledge points that would help your task. In context learning is one example of how this works well. Just that even here, if the context is out of distribution, a model does not "understand" it and would produce guesswork.

When it comes to LoRA or PEFT or adapters, it's about style transfer. And if you focus on a specific style of content, you will see the gains, just that the model wont learn new knowledge that wasnt already in original training data. It will forget previously learnt styles depending on context. When you do full finetuning (or SFT with no frozen parameters), it will alter all the parameters, and results in gain of new knowledge at the cost of previous knowledge (and would give you some gibberish if you ask about topics outside of domain). This is called catastrophic forgetting. Hence, yes, full finetuning works - just that it is an imperfect solution like all the others. Recently, with Reinforcement learning, there have been talks of continual learning, where Richard sutton's latest paper also lands at, but thats at research level.

Having said all that, if you start with the wrong mental model for Finetuning, you would be disappointed with the results.

The problem to solve is about adding new knowledge, while preserving the original pretrained intelligence. Still in wip, but we published a paper last year on one way it could be done. Here is the link: https://arxiv.org/abs/2409.17171 (it also has results for experiments all different approaches).

[1]: Styling here means the style learned by the model in SFT. Eg: Bullets, lists, bolding out different headings etc. all of that makes the content readable. The understanding of how to present the answer to a specific question.

solresol - 7 days ago

I think of it as trying to encourage the LLM to want to give answers from a particular part of the phase space. You can do it by fine tuning it to be more likely to return values from there, or you can prompt it to get into that part of the phase space. Either works, but fiddling around with prompts doesn't require all that much MLops or compute power.

That said, fine tuning small models because you have to power through vast amounts of data where a larger model might be cost ineffective -- that's completely sensible, and not really mentioned in the article.

Mathnerd314 - 7 days ago

Wasn't there that thing about how large LLM's are essentially compression algorithms (https://arxiv.org/pdf/2309.10668)? Maybe that's where this article is coming from, is the idea that finetuning "adds" data to the set of data that compresses well. But that indeed doesn't work unless you mix in the finetuning data with the original training corpus of the base model. I think the article is wrong though in saying it "replaces" the data - it's true that finetuning without keeping in the original training corpus increases loss on the original data, but "large" in LLM really is large and current models are not trained to saturation so there is plenty of room to fit in finetuning if you do it right.

elzbardico - 6 days ago

Lots of prophets in every gold rush...

While the author makes some good points (along with some non-factual assertions), I wonder why he decided to have this counter-productive and factually wrong clickbait title.

Fine-tuning (and LoRA IS fine-tuning) may not be cost-effective for most organizations for knowledge updates, but it excels in driving behavior in task specific ways, for alignment, for enforcing structured output (usually way more accurately than prompting), tool and function use, and depending on the type of knowledge, if it is highly specific, niche, long tail type of knowledge, it can even make smaller models beat bigger models, like the case with MedGemma.

rco8786 - 6 days ago

> Instead, use modular methods like retrieval-augmented generation, adapters, or prompt-engineering — these techniques inject new information without damaging the underlying model’s carefully built ecosystem.

So obviously this is what most of us are already doing, I would venture. But there's a pretty big "missing middle" here. RAG/better prompts serve to provide LLMs with the context they need for a specific task, but are heavily limited by context windows. I know they've been growing quite a bit, but from my usage it still seems that things further back in the window get forgotten about pretty regularly.

Fine tuning was always the pitch for the solution to that. By baking the "context" you need directly into the LLM. Very few people or companies are actually doing this though, because it's expensive and you end up with an outdated model by the time you're done...if you even have the data you need to do it in the first place.

So where we're left is basically without options for systems that need more proprietary knowledge than we can reasonably fit into the context window.

I wonder if there's anyone out there attempting to do some sort of "context compression". An intermediary step that takes our natural language RAG/prompts/context and compresses it into a data format that the LLM can understand (vectors of some sort?) but are a fraction of the tokens that the natural language version would take.

edit After I wrote this I fed this into chatgpt and asked if there were techniques i am missing. It introduced me to Lora (which I suppose are the "adapters" mentioned in the OP). and now I have a whole new rabbithole to climb down. AI is pretty cool sometimes.

ilaksh - 7 days ago

Obviously there are going to be narrow tasks where fine tuning makes sense. But using leading models for agents is a completely different mindset and approach.

Because I have been working on replacing multiple humans handling complex business processes mostly end-to-end (with human in the loop somehow in there).

I find that I need the very best models to be able to handle a lot of instructions and make the best decisions about tool selection. And overall I just need the most intelligence possible to make fewer weird errors or misinterpretations of the instructions or situations/data.

I can see how fine tuning would help for some issues like some report formatting. But that output comes at the end of the whole process. And I can address formatting issues almost instantly by either just using a smarter model that follows instructions better, or adding a reminder instruction, or creating a simpler subtask. Sometimes the subtask can run on a cheaper model.

So it's kind of like the difference between building a traditional manufacturing line with very specific robot arms, tooling and and conveyor belts, versus plugging in just a few different humanoid robots with assembly manuals and access to more general purposes tools on their belt. You used to always have to build the full traditional line. In many cases that doesn't necessarily make sense anymore.

gdiamos - 7 days ago

It's pretty frustrating to spend weeks on finetuning and end up with a model that says:

"SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT ..."

mountainriver - 5 days ago

I love how people say things like this with complete disregard for research.

Most LLM research involves fine tuning models, and we do amazing things with it. R1 is a fine tune, but I guess that’s bad?

Our company adds knowledge with fine tuning all the time. It’s usually a matter of skill not some fundamental limit. You need to either use LoRA or use a large batch size and mix the previous training data in.

All we are doing is forcing deep representations. This isn’t a binary “fine tuning good/bad” it’s a spectrum of how deep and robust you make the representations

adultSwim - 9 hours ago

For medical applications, across several generations of models, we see fine-tuned models outperform base models of similar size. However, newer/bigger general base models outperform smaller fine-tuned models.

Also, as others have pointed out, supervised fine-tuning can be quite useful for teaching how to perform specific tasks. I agree with the author that RAG generally is more suited for injecting additional knowledge.

Nevermark - 7 days ago

It would be very interesting to fine tune a model for a narrow task, while tracking its performance on every original training sample from the pre-tuning baseline.

I expect it would greatly help characterize what was lost, at the expense of a great deal of extra computation. But with enough experiments might shed some more general light.

I suspect the smaller the tuning dataset, the faster and worse the overwriting will be, since the new optimization surface will be so much simpler to navigate than the much bigger datasets optimization surface.

Then a question might be, what percentage of the original training data, randomly retained, might slow general degradation.

mehulashah - 7 days ago

Fine tuning isn’t for everything but certainly makes it easy to build models for special purposes, eg metadata extraction. Happy to lose some capability in another domain for that, eg Pokémon. The headline is a bit too general.

arbfay - 6 days ago

Before post-ChatGPT boom, we used to talk of "catastrophic forgetting"...

Make sure the new training dataset is "large" by augmenting it with general data (see it as a sample of the original dataset), use PEFT techniques (freezing weights => less risks), use regularization (elastic weight consolidation).

Fine-tuning is fine, but will be more expensive that you thought and should be led by more experienced ML engineers. You probably don't need to fine tune models anyway.

Kiyo-Lynn - 7 days ago

I feel that the effects of fine-tuning are often short-term, and sometimes it can end up overwriting what the model has already learned, making it less intelligent in the process. I lean more towards using adaptive methods, optimizing prompts, and leveraging more efficient ways to handle tasks. This feels more practical and resource-efficient than blindly fine-tuning. We should focus on finding ways to maximize the potential of existing models without damaging their current capabilities, rather than just relying on fine-tuning.

ZacWil - 6 days ago

Correct me if I am wrong, but I thought the point of fine-tuning was to get precise returns. We make it hyper specific to the task at hand. Sure, we can get 90% of the way there without fine-tuning, but most of these models are vast. I would argue that it potentially MAY be a waste of time right out the gate.

mapinxue - 7 days ago

RAG and fine-tuning are suitable for different business scenarios. For some directional and persistent knowledge, such as adjustments for power, energy and other fields, it can bring better performance;

RAG is more oriented to temporary and variable situations.

In addition, LoRA is also a fine-tuning technology,and it is written in their paper.

a_c - 6 days ago

I don’t know if fine tuning works. But if it doesn’t, then are we assuming the underlying weights are optimal? At what point do we determine that a network is properly “trained” and any subsequent training is “fine tuning”.

varsketiz - 7 days ago

I am under the impression that fine tuning is expensive (could anyone put a number on that?) and that each time a new model is released you have to fine tune it again, paying full price every time.

clauderoux - 7 days ago

Seriously, most fine-tuning now is done with LoRa adapters. They are much faster and more reliable. In my lab, I don't know anybody who is trying to do any kind of thorough fine-tuning...

- 7 days ago
[deleted]
Havoc - 6 days ago

Overwrite seems a bit strong. Closer to adjusting. Which is the whole point of fine tuning.

titaniumrain - 5 days ago

This post is hilarious. People like this author are the ones vetting start-ups? Please. The idea that alignment leads to a degradation in model utility is hardly news.

But let’s be clear: fine-tuning an LLM to specialize in a task isn’t just about minimizing utility loss. It’s about trade-offs. You have to weigh what you gain against what you lose.

iamnotagenius - 7 days ago

Fine-tuning is excellent way to reliably bake-in domain specific data into a model; there is a plenty of coding finetunes on Huggingface face, that outperforms foundation models on say coding, without significant loss in other domains.

j-wang - 7 days ago

"But this logic breaks down for advanced models, and badly so. At high performance, fine-tuning isn’t merely adding new data — it’s overwriting existing knowledge. Every neuron updated risks losing information that’s already intricately woven into the network. In short: neurons are valuable, finite resources. Updating them isn’t a costless act; it’s a dangerous trade-off that threatens the delicate ecosystem of an advanced model."

Mainly including this article to spark discussion—I agree with some of this and not with all of it. But it is an interesting take.