Sopro TTS: A 169M model with zero-shot voice cloning that runs on the CPU

github.com

330 points by sammyyyyyyy a day ago


realityfactchex - a day ago

That's cool and useful.

IMO, the best alternative is Chatterbox-TTS-Server [0] (slower, but quite high quality).

[0] https://github.com/devnen/Chatterbox-TTS-Server

armcat - 9 hours ago

Super nice! I've been using Kokoro locally, which is 82M parameters and runs (and sounds) amazing! https://huggingface.co/hexgrad/Kokoro-82M

VerifiedReports - 14 hours ago

What is "zero-shot" supposed to mean?

blitzar - a day ago

Mission impossible cloning skills without the long compile time.

"The pleasure of Buzby's company is what I most enjoy. He put a tack on Miss Yancy's chair ..."

https://www.youtube.com/watch?v=H2kIN9PgvNo

https://literalminded.wordpress.com/2006/05/05/a-panphonic-p...

yamal4321 - 17 hours ago

Tried english. There are similarities. Really impressive for such budget Also increadibly easy to use, thanks for this

btbuildem - a day ago

It's impressive given the constraints!

Would you consider releasing a more capable version that renders with fewer artifacts (and maybe requires a bit more processing power)?

Chatterbox is my go-to, this could be a nice alternative were it capable of high-fidelity results!

SoftTalker - a day ago

What does "zero-shot" mean in this context?

guerrilla - 19 hours ago

I don't understand the comments here at all. I played the audio and it sounds absolutely horrible, far worse than computer voices sounded fifteen years ago. Not even the most feeble minded person would mistake that as a human. Am I not hearing the same thing everyone else is hearing? It sounds straight up corrupted to me. Tested in different browsers, no difference.

LoveMortuus - 10 hours ago

This is very cool! And it'll only get better. I do wonder, if, at least as a patch-up job, they could do some light audio processing to remove the raspiness from the voices.

derefr - 21 hours ago

Is there yet any model like this, but which works as a "speech plus speech to speech" voice modulator — i.e. taking a fixed audio sample (the prompt), plus a continuous audio stream (the input), and transforming any speech component of the input to have the tone and timbre of the voice in the prompt, resulting in a continuous audio output stream? (Ideally, while passing through non-speech parts of the input audio stream; but those could also be handled other ways, with traditional source separation techniques, microphone arrays, etc.)

Though I suppose, for the use-case I'm thinking of (v-tubers), you don't really need the ability to dynamically change the prompt; so you could also simplify this to a continuous single-stream "speech to speech" model, which gets its target vocal timbre burned into it during an expensive (but one-time) fine-tuning step.

krunck - 18 hours ago

I just had some amusing results using text with lots of exclamations and turning up the temperature. Good fun.

woodson - 21 hours ago

Does the 169M include the ~90M params for the Mimi codec? Interesting approach using FiLM for speaker conditioning.

convivialdingo - a day ago

Impressive! The cloning and voice affect is great. Has a slight warble in the voice on long vowels, but not a huge issue. I'll definitely check it out - we could use voice generation for alerting on one of our projects (no GPUs on hardware).

lukebechtel - a day ago

Very cool. I'd love a slightly larger version with hopefully improved voice quality.

Nice work!

elaus - a day ago

Very nice to have done this by yourself, locally.

I wish there was an open/local tts model with voice cloning as good as 11l (for non-english languages even)

jacquesm - 21 hours ago

What could possibly go wrong...

Don't you ever think about what the balance of good and bad is when you make something like this? What's the upside? What's the downside?

In this particular case I can only see downsides, if there are upsides I'd love to hear about them. All I see is my elderly family members getting 'me' on their phones asking for help, and falling for it.

I've gotten into the habit of waiting for the other person to speak first when I answer the phone now and the number is unknown to me.

jokethrowaway - 8 hours ago

I'm sure it has its uses, but for anything with a higher requirement for quality, I think Vibe Voice is the only real OSS cloning option.

F2/E5 are also very good but have plenty of bad runs, you need to keep re-rolling until you get good outputs.

Gathering6678 - 18 hours ago

Emm...I played the sample audio and it was...horrible?

How is it voice cloning if even the sample doesn't sound like any human being...

sergiotapia - 20 hours ago

It sounds a lot like RFK Jr! Does anyone have any more casual examples?

jokethrowaway - 8 hours ago

Sorry but the quality is too bad.

I'm sure it has its uses, but for anything practical I think Vibe Voice is the only real OSS cloning option. F2/E5 are also very good but has plenty of bad runs, you need to keep re-rolling.

nunobrito - 21 hours ago

Muito fixe. Now the next challenge (for me) is how to convert this to DART and run on Android. :-)

brikym - 21 hours ago

A scammers dream.