Open Weights isn't Open Training

103 points by addiefoote8 a day ago

The framing here is undersold in the broader discourse: "open weights" is a ruse for reproducibility. What you have is closer to a compiled binary than source code. You can run it, you can diff it against other binaries, but you cannot, in any meaningful sense, reproduce or extend it from first principles.

This matters because OSS truly depends on the reproducibility claim. "Open weights" borrows the legitimacy of open source (the assumption that scrutiny is possible, that no single actor has a moat, that iteration is democratised). Truly democratised iteration would crack open the training stack and let you generate intelligence from scratch.

Huge kudos to Addie and the team for this :)

Wowfunhappy - 15 hours ago

But how useful is source code if it takes millions of dollars to compile? At that point, if you do need to make changes, it probably makes more sense to edit the precompiled binary. Even the original developers are doing binary edits in most cases.
I agree that open weight models should not be considered open source, but I also think the entire definition breaks down under the economics of LLMs.
- scottlamb - 15 hours ago
  
  There are lots of reasons to read through source code you never edit or recompile: security audits, interoperability, learning from their techniques, etc. And I think many of those same ideas apply to seeing the training data of a LLM. It will help you understand quickly (without as much experimentation) what it's likely to be good at, where its biases may be, where some kind of supplement (transfer learning? RAG? whatever) might be needed. And the why.
  - vova_hn2 - 9 hours ago
    
    > security audits
    If you are unable to run the multimillion training, then any kind of security audit of the training code is absolutely meaningless, because you have no way to verify that the weights were actually produced by this code.
    Also, the analogy with source code/binary code fails really fast, considering that model training process is non-deterministic, so even if are able to run the training, then you get different weights than those that were released by the model developers, then... then what?
    
    HappMacDonald - 2 hours ago
    
    > considering that model training process is non-deterministic
    Why would it have to be? Just use PRNG with published seeds and then anyone can reproduce it.
    
    dataflow - an hour ago
    
    I have zero actual experience in training models, but in general, when parallelizing work: there can be fundamental nondeterminism (e.g., some race conditions) that is tolerated, whose recording/reproduction can be prohibitive performance-wise.
  - oscarmoxon - 15 hours ago
    
    Agree, this feels like a distinction that needs formalising...
    Passive transparency: training data, technical report that tells you what the model learned and why it behaves the way it does. Useful for auditing, AI safety, interoperability.
    Active transparency: being able to actually reproduce and augment the model. For that you need the training stack, curriculum, loss weighting decisions, hyperparameter search logs, synthetic data pipeline, RLHF/RLAIF methodology, reward model architecture, what behaviours were targeted and how success was measured, unpublished evals, known failure modes. The list goes on!
    
    addiefoote8 - 15 hours ago
    
    I'd also add training checkpoints to the list for active transparency. I think the Olmo models do a decent job, but it would be cool to see it for bigger models and for ones that are closer to state-of-the-art in terms of both architecture and algorithms.
  - kazinator - 11 hours ago
    
    Security audits, etc, are possible because binary code closely implements what the source code says.
    In this case, you have no idea what the weights are going to "do", from looking at the source materials --- the training data and algorithm --- without running the training on the data.
- oscarmoxon - 15 hours ago
  
  Compute costs are falling fast, training is getting cheaper. GPT-2 costs pocket change to train, and now it costs pocket train to tune >1T parameter models. If it was transparent what costs went into the weights, they could be commodified and stripped of bloat. Instead the hidden cost is building the infrastructure that was never tested at scale by anyone other than the original developers who shipped no documentation of where it fails. Unlike compute, this hidden cost doesn't commodify on its own.
- - 15 hours ago
  
  [deleted]
- addiefoote8 - 15 hours ago
  
  yeah, the costs are definitely a factor and prohibitive in completely replicating an open source model. Still, there's a lot of useful things that can be done cheaply, including fine tuning, interpretability work, and other deeper investigations into the model that can't happen without the infrastructure.
anon373839 - 4 hours ago

> "Open weights" borrows the legitimacy of open source
I don't really see how open-weights models need to borrow any legitimacy. They are valuable artifacts being given away that can be used, tested and repurposed forever. Fully open models like the OLMo series and Nvidia's Nemotron are much more valuable in some contexts, but they haven't quite cracked the level of performance that the best open-weights models are hitting. And I think that's why most startups are reaching for Chinese base LLMs when they want to tune custom models: the performance is better and they were never going to bother with pretraining anyway.
maxwg - 10 hours ago

The training methods are largely published in their open research papers - though arguably some open weight companies are less open with the exact details.
Realistically a model will never be "compiled" 1:1. Copyrighted data is almost certainly used and even _if_ one could somehow download the petabytes of training data - it's quite likely the model would come out differently.
The article seems to be talking more about the difficulties of fine tuning models though - a setup problem that likely exists in all research, and many larger OSS projects that get more complicated.
- alansaber - 10 hours ago
  
  Yes the issue is they can embelish the shit out of the papers b/c we only see the final result

2001zhaozhao - 3 hours ago

Open-weight AI is actually analogous to closed source, free shareware you can decompile and modify yourself and run on your computer or a cloud server of your choice.

It's a clear distinction to proprietary AI, which is analogous to SaaS software controlled by a company that runs it on its own cloud, and owns your data.

But it's still not open source.

throwaway2037 - an hour ago

The picture on that blog post is very cool. It gives Hetch Hetchy vibes. Is there software to convert a photo into ASCII art?

mnkv - 13 hours ago

This blog post describes the basic work of a research engineer and nothing more. The amount of surprise the author has seems to suggest they haven't really worked in ML for very long.

Honestly? This is the best its ever been. Getting stuff to run before huggingface and uv and docker containers with cuda was way worse. Even with full open-source, go try to run a 3+ years old model and codebase. The field just moves very fast.

est - 4 hours ago

Even if you have open training, the corpus were compiled from millions of sources, labeled by experts manually, by AI, by outsourcing, etc. Is it "open by first principle" ?

timmg - 14 hours ago

Somewhat orthogonal but: when do we expect "volunteer" groups to provide training data for LLMs for [edit: free] for (like) hobbyist kinds of things? (Or do we?)

Like wikipedia probably provides a significant amount of training for LLMs. And that is volunteer and free. (And I love the idea of it.)

But I can imagine (for example) board game enthusiasts to maybe want to have training data for games they love. Not just rules but strategies.

Or, really, any other kind of hobby.

That stuff (I guess) gets in training data by virtue of being on chat groups, etc. But I feel like an organized system (like wikipedia) would be much better.

And if these sets were available, I would expect the foundation model trainers would love to include it. And the results would be better models for those very enthusiasts.

oscarmoxon - 14 hours ago

Some of this exists already in pockets (Common Crawl, The Pile, RedPajama are all volunteer/open efforts). I suppose there's no equivalent of the "edit this page and see the impact" like with have with Wikipedia. Contributing to an open dataset has no feedback loop if the training infrastructure that would consume it is closed... seems like a feedback problem.
djoldman - 13 hours ago

https://arxiv.org/abs/2304.07327

mirekrusin - 12 hours ago

Isn't LoRA solved problem by unsloth?

addiefoote8 - 11 hours ago

Unsloth doesn't support distributed training well and doesn't support Kimi models.
- alansaber - 10 hours ago
  
  Distributed training has been on their to do list for a good long while iirc
alansaber - 10 hours ago

Model compression more generally is far from a solved field

asah - 10 hours ago

what about distillation methods ?

cat_plus_plus - 10 hours ago

Well, it's open training in the sense that the code is open source and you are free to fix it so it trains successfully. That's consistent with how open source works generally. In my experience unsloth is where new model training is usually fixed first.

mschuster91 - 15 hours ago

"open training" is something that won't ever happen for large scale models. For one, probably everyone's training datasets include large amount of questionable material: copyrighted media first and foremost (court cases have shown that AI models can regurgitate entire books almost verbatim), but also AI slop contaminating the dataset, or on the extreme end CSAM - for Grok to know how the intimate bits of children look like (which is what was shown during the time anyone could prompt it with "show her in a bikini") it obviously has to have ingested CSAM during training.

And then, a ton of training still depends on human labor - even at $2/h in exploitative bodyshops in Kenya [1], that still adds up to a significant financial investment in training datasets. And image training datasets are expensive to train as well - Google's reCAPTCHA used millions of hours of humans classifying which squares contained objects like cars or motorcycles.

[1] https://time.com/6247678/openai-chatgpt-kenya-workers/

hananova - 14 hours ago

I’m not convinced that Grok’s dataset must contain CSAM for it to generate CSAM. Surely a combination of nude adults and clothed children would allow for it to synthesize CSAM?
(Disclaimer: I’m not in favor of AI in general and definitely not in favor of what Grok is doing specifically. I’m just entirely sold on the claim that its dataset must contain CSAM, though I think it is probably likely that it has at least some, because cleaning up such a massive dataset carefully and thoroughly costs money that Elon wouldn’t want to spend.)
- fragmede - 3 hours ago
  
  How many llamas riding motorcycles are in the dataset for it to be able to generate images for that? Why does that not extend to CSAM?
iamcreasy - 12 hours ago

> "open training" is something that won't ever happen for large scale models
https://www.swiss-ai.org/apertus
Source: EPFL, ETH Zurich, and the Swiss National Supercomputing Centre (CSCS) has released Apertus, Switzerland’s first large-scale open, multilingual language model — a milestone in generative AI for transparency and diversity. Trained on 15 trillion tokens across more than 1,000 languages – 40% of the data is non-English – Apertus includes many languages that have so far been underrepresented in LLMs, such as Swiss German, Romansh, and many others. Apertus serves as a building block for developers and organizations for future applications such as chatbots, translation systems, or educational tools. The model is named Apertus – Latin for “open” – highlighting its distinctive feature: the entire development process, including its architecture, model weights, and training data and recipes, is openly accessible and fully documented.
- mschuster91 - 10 hours ago
  
  I wasn't aware of that one, thanks.
  Should have been more clear in my wording though - I was referring to commercially useful models.
pfortuny - 14 hours ago

The human labor aspect is very little discussed and essential and very abusive, I am sure.
People think of these models as "magic" and "science" but they do not realize the immense amount (in human years) of clicking yes/no in front of thousands of pairs of input/outputs.
I worked for some months as a Google Quality Rater (wow), and know the job. This must be much worse.
oscarmoxon - 14 hours ago

Agree that this makes it unlikely we see frontier training data OS'd but this is a separate problem from software and infrastructure transparency, which has none of those constraints. Training stack, the parallelism decisions, documented failure modes are engineering knowledge and there's no principled reason it doesn't ship.
addiefoote8 - 15 hours ago

I agree full transparency on data adds several other challenges. Still, even releasing the software and infrastructure aspects would be a huge step from where we are now. Also, some recent work has shown pretraining filtering to be possible and beneficial which could help mitigate some concerns of sensitive data in the datasets.