Complete silence is always hallucinated as "ترجمة نانسي قنقر" in Arabic
github.com510 points by edent 15 hours ago
510 points by edent 15 hours ago
The same happens with whisper-large-v3 on Chinese transcription: silence is transcribed to something like "please upvote, share and favourite this video". I suspect they trained the model on some random YouTube video without carefully picking really useful data.
In Chinese, it always added something like "For study/research purpose only. Please delete after 48 hours." This is what those volunteers added in subtitles of (pirated) movies/shows.
Fair, if AI companies are allowed to download pirated content for "learning", why ordinary people cannot.
There is so much damning evidence that AI companies have committed absolutely shocking amounts of piracy, yet nothing is being done.
It only highlights how the world really works. If you have money you get to do whatever the fuck you want. If you're just a normal person you get to spend years in jail or worse.
Reminds me of https://www.youtube.com/watch?v=8GptobqPsvg
The dead corpses of filmmakers and authors and actors are buried in unmarked graves out behind those companies' corporate headquarters. Unimaginable horror, that piracy. Why has no one intervened?
>If you're just a normal person you get to spend years in jail or worse.
Not that I'm a big fan of the criminalization of copyright infringement in the United States, but who has ever spent years in jail for this?
Besides, if it really bothered you, then we might not see this weird tone-switch from one sentence to the next, where you seem to think that piracy is shocking and "something should be done" and then "it's not good tht someone should spend time in jail for it". What gives?
> Besides, if it really bothered you, then we might not see this weird tone-switch from one sentence to the next, where you seem to think that piracy is shocking and "something should be done" and then "it's not good tht someone should spend time in jail for it". What gives?
What a weirdly condescending way to interpret my post. My point boils down to: Either prosecute copyright infringement or don't. The current status quo of individuals getting their lives ruined while companies get to make billions is disgusting.
> Either prosecute copyright infringement or don't
This is the absolute core of the issue. Technical people see law as code, where context can be disregarded and all that matters is specifying the outputs for a given set of inputs.
But law doesn’t work that way, and it should not work that way. Context matters, and it needs to.
If you go down the road of “the law is the law and billion dollar companies working on product should be treated the same as individual consumers”, it follows that individuals should do SEC filings (“either require 10q’s or don’t!”), and surgeons should be jailed (“either prosecute cutting people with knives or don’t!”).
There is a lot to dislike about AI companies, and while I believe that training models is transformative, I don’t believe that maintaining libraries of pirated content is OK just because it’s an ingredient to training.
But insisting that individual piracy to enjoy entertainment without paying must be treated exactly the same as datasets for model training is the absolute weakest possible argument here. The law is not that reductive.
There's actually a lot of court activity on this topic, but the law moves slowly and is reluctant to issue injunctions where harm is not obvious.
It's more that the law about "one guy decides to pirate twelve movies to watch them at home and share with his buddies" is already well-settled, but the law about "a company pirates 10,000,000 pieces to use as training data for an AI model (a practice that the law already says is legal in an academic setting, i.e. universities do this all the time and nobody bats an eye)" is more complicated and requires additional trials to resolve. And no, even though the right answer may be self-evident to you or me, it's not settled law, and if the force of law is applied poorly suddenly what the universities are doing runs afoul of it and basically nobody wants that outcome.
There is a distinction that must be made that very few people do, but thankfully the courts seems to grasp:
Training on copyright is a separate claim than skirting payment for copyright.
Which pretty much boils down to: "If they put it out there for everyone to see, it's probably OK to train on it, if they put it behind a paywall and you don't pay, the training part doesn't matter, it's a violation."
Whether it’s legal slash fair use to train on copyrighted material is only one of the questions currently being asked though. There’s a separate issue at play where these companies are pirating the material for the training process.
By comparison, someone here brought up that it might be transformative fair use to write a play heavily based on Blood Meridian, but you still need to buy a copy of the book. It would still be infringement to pirate the e-book for your writing process, even if the end result was legal.
If they would buy material at a large scale, the seller might require them to sign a contract that requires royalty if the material is used for training an AI. So buying legally is a way to put yourself into a trap.
They can buy individual works like anyone else.
Or they can negotiate a deal at scale with whatever price / restrictions make sense to both parties.
I don’t see a way they could be “trapped”. Worst case they pay retail price.
What is the precedent on that kind of agreement?
The only thing I've been able to find is the note that since copyright is federal law, state contract law actually can't supersede it, to wit: if you try to put a clause in the contract that says the contract is void if I use your work to make transformative fair-use works (or I owe you a fee), that clause is functionally unenforceable (for the same reason that I don't owe you a fee if I make transformative fair-use works of your creations in general).
So if I download copyrighted material like the new disney movie with fansubs and watch it for training purposes instead of enjoyment purposes it's fine? In that case I've just been training myself, your honor. No, no, I'm not enjoying these TV shows.
Because it's important to grasp the scale of these copyright violations:
* They downloaded, and admitted to using, Anna's Archive: Millions of books and papers, most of which are paywalled but they pirated it instead
* They acquired Movies and TV shows and used unofficial subtitles distributed by websites such as OpenSubtitles, which are typically used for pirated media. Official releases such as DVDs tend to have official subtitles that don't sign off with "For study/research purpose only. Please delete after 48 hours" or "Subtitles by %some_username%"
I don't know what is confusing here, perhaps my comment isn't clear.
If you skirt payment, its a violation. If it's free, but still copyright, it's likely not a violation.
They've done both, so my confusion is about why you are bringing this up?
If you owe the bank $1,000 you have a problem.
If you owe the bank $100,000,000 the bank has a problem.
We live in an era where the president of the United States uses his position to pump crypto scams purely for personal profit.
No one (in the US) has been jailed for downloading copyrighted material.
https://en.wikipedia.org/wiki/Aaron_Swartz
And the US is not the only jurisdiction
That's not the same as piracy though. He wasn't downloading millions of scientific papers from libgen or sci-hub, he was downloading them directly from jstor. Indeed, none of his charge was for copyright infringement. It was for stuff like "breaking and entering" and "unauthorized access to a computer network".
The exact same charges could apply to the AI scrapers illegitimately accessing random websites.
No, they couldn't, since the then-novel and untested strained interpretation of the CFAA that the prosecutor was relying on has since been tested in the courts and soundly rejected.
I haven’t seen any accusations that they’ve done that, though. Usually people get pirated material from sources that intentionally share pirated material.