Open Weights isn't Open Training

workshoplabs.ai

103 points by addiefoote8 a day ago


oscarmoxon - a day ago

The framing here is undersold in the broader discourse: "open weights" is a ruse for reproducibility. What you have is closer to a compiled binary than source code. You can run it, you can diff it against other binaries, but you cannot, in any meaningful sense, reproduce or extend it from first principles.

This matters because OSS truly depends on the reproducibility claim. "Open weights" borrows the legitimacy of open source (the assumption that scrutiny is possible, that no single actor has a moat, that iteration is democratised). Truly democratised iteration would crack open the training stack and let you generate intelligence from scratch.

Huge kudos to Addie and the team for this :)

2001zhaozhao - 3 hours ago

Open-weight AI is actually analogous to closed source, free shareware you can decompile and modify yourself and run on your computer or a cloud server of your choice.

It's a clear distinction to proprietary AI, which is analogous to SaaS software controlled by a company that runs it on its own cloud, and owns your data.

But it's still not open source.

throwaway2037 - an hour ago

The picture on that blog post is very cool. It gives Hetch Hetchy vibes. Is there software to convert a photo into ASCII art?

mnkv - 13 hours ago

This blog post describes the basic work of a research engineer and nothing more. The amount of surprise the author has seems to suggest they haven't really worked in ML for very long.

Honestly? This is the best its ever been. Getting stuff to run before huggingface and uv and docker containers with cuda was way worse. Even with full open-source, go try to run a 3+ years old model and codebase. The field just moves very fast.

est - 4 hours ago

Even if you have open training, the corpus were compiled from millions of sources, labeled by experts manually, by AI, by outsourcing, etc. Is it "open by first principle" ?

timmg - 14 hours ago

Somewhat orthogonal but: when do we expect "volunteer" groups to provide training data for LLMs for [edit: free] for (like) hobbyist kinds of things? (Or do we?)

Like wikipedia probably provides a significant amount of training for LLMs. And that is volunteer and free. (And I love the idea of it.)

But I can imagine (for example) board game enthusiasts to maybe want to have training data for games they love. Not just rules but strategies.

Or, really, any other kind of hobby.

That stuff (I guess) gets in training data by virtue of being on chat groups, etc. But I feel like an organized system (like wikipedia) would be much better.

And if these sets were available, I would expect the foundation model trainers would love to include it. And the results would be better models for those very enthusiasts.

mirekrusin - 12 hours ago

Isn't LoRA solved problem by unsloth?

asah - 10 hours ago

what about distillation methods ?

cat_plus_plus - 10 hours ago

Well, it's open training in the sense that the code is open source and you are free to fix it so it trains successfully. That's consistent with how open source works generally. In my experience unsloth is where new model training is usually fixed first.

mschuster91 - 15 hours ago

"open training" is something that won't ever happen for large scale models. For one, probably everyone's training datasets include large amount of questionable material: copyrighted media first and foremost (court cases have shown that AI models can regurgitate entire books almost verbatim), but also AI slop contaminating the dataset, or on the extreme end CSAM - for Grok to know how the intimate bits of children look like (which is what was shown during the time anyone could prompt it with "show her in a bikini") it obviously has to have ingested CSAM during training.

And then, a ton of training still depends on human labor - even at $2/h in exploitative bodyshops in Kenya [1], that still adds up to a significant financial investment in training datasets. And image training datasets are expensive to train as well - Google's reCAPTCHA used millions of hours of humans classifying which squares contained objects like cars or motorcycles.

[1] https://time.com/6247678/openai-chatgpt-kenya-workers/