Chatterbox TTS
github.com663 points by pinter69 5 days ago
663 points by pinter69 5 days ago
Demos here: https://resemble-ai.github.io/chatterbox_demopage/ (not mine)
This is a good release if they're not too cherry picked!
I say this every time it comes up, and it's not as sexy to work on, but in my experiments voice AI is really held back by transcription, not TTS. Unless that's changed recently.
FWIW in my recent experience I've found LLMs are very good at reading through the transcription errors
(I've yet to experiment with giving the LLM alternate transcriptions or confidence levels, but I bet they could make good use of that too)
Pairing speech recognition with a LLM acting as a post-processor is a pretty good approach.
I put together a script a while back which converts any passed audio file (wav, mp3, etc.), normalizes the audio, passes it to ggerganov whisper for transcription, and then forwards to an LLM to clean the text. I've used it with a pretty high rate of success on some of my very old and poorly recorded voice dictation recordings from over a decade ago.
Public gist in case anyone finds it useful:
https://gist.github.com/scpedicini/455409fe7656d3cca8959c123...
An LLM step also works pretty well for diarization. You get a transcript with speaker-segmentation (with whisper and pyannote for example), SPEAKER_01 says at some point „Hi I’m Bob. And here’s Alice“, SPEAKER_02 says „Hi Bob“ and now the LLM can infer that SPEAKER_01 = Bob and SPEAKER_02 = Alice.
Yep, my agent i built years ago worked very well with this approach, using a whisper-pyannote combo. The fun part is knowning when to end transcription in noisy environments like a coffee shop.
thanks for sharing. are some local models better than others? can small models work well or do you want 8B+?
So in my experience smaller models tend to produce worse results BUT I actually got really good transcription cleanup with CoT (Chain of Thought models) like Qwen even quantized down to 8b.
I think the 8B+ question was about parameter count (8 billion+ parameters), not quantization level (8 bits per weight).
Yeah I should have been more specific - Qwen 8b at a 5_K_M quant worked very well.
I was going to say, ideally you’d be able to funnel alternates to the LLM, because it would be vastly better equipped to judge what is a reasonable next word than a purely phonetic model.
If you just give the transcript, and tell the LLM it is a voice transcript with possible errors, then it actually does a great job in most cases. I mostly have problems with mistranscriptions saying something entirely plausible but not at all what I said. Because the STT engine is trying to make a semantically valid transcription it often produces grammatically correct, semantically plausible, and incorrect transcriptions. These really foil the LLM.
Even if you can just mark the text as suspicious I think in an interactive application this would give the LLM enough information to confirm what you were saying when a really critical piece of text is low confidence. The LLM doesn't just know what are the most plausible words and phrases for the user to say, but the LLM can also evaluate if the overall gist is high or low confidence, and if the resulting action is high or low risk.
This is actually something people used to do.
old ASR systems (even models like Wav2vec) were usually combined with a language model. It wasn't a large language model, those didn't exist at the time, it was usually something based on n-grams.
do you know if any current locally hostable public transcribers are good at diarization? for some tasks having even crude diarization would improve QOL by a huge factor. i was looking at a whisper diarization python package for a bit but it was a bitch to deploy.
yeah as i said, i couldn't figure out how to deploy whisper-diarization.
so you need python - a full install, and git. Doesn't matter OS. python venv (virtual environment) ensures that this folder, once it works, is locked to all the versions inside it, including the python version. this works for any software that uses pip to set up, or any python stuff in general.
git clone <whisper-diarization.git URL>
cd whisper-diarization
python -m venv .
cd scripts
# and then depending on your OS it's activate.sh, activate.ps1, activate.bat, etc. so on linux [0]
your prompt should change to say(whisper-diarization) <your OS prompt>$
now you can type
cd ..
pip install -c constraints.txt -r requirements.txt
python ./diarize.py --no-stem --suppress_numerals --whisper-model large-v3-turbo --device cuda -a <FILE>
next time you want to use it, you can just do like cd ~/whisper-diarization
scripts/activate.sh (or whatever) [0]
python ./diarize.py [...]
[0]
To activate a Python virtual environment created with venv, use the command source venv/bin/activate
on Linux or macOS, or venv\Scripts\activate
on Windows. This will change your terminal prompt to indicate that the virtual environment is active.(the [0] note was 'AI generated' by DDG, but whatever, linux puts it in ./bin/activate and windows puts it in ./Scripts/activate.ps1 (ideally))
For English-only an non-commercial, Parakeet has been almost flawless for me.
https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2
I use it for real-time chat and generating subtitles. It can do a tv show in less than a minute on a 3090.
Whisper always hallucinated too much for me. It's more useful as a classifier.
It would be nice if there was some type of front-end integration that would present the user with a list of heteronyms found in the text and ask for clarification for each one. As well as having lists of common phrases to compare them against. There's really no excuse for an LLM to mispronounced "live feed" or "live here".
Right you are. I've used speechmatics, they do a decent jon with transcription
1 error every 78 characters?
The way to measure transcription accuracy is word error and not character error. I have not really checked or trusted) speechmatics' accuracy benchmarks But, from my experience and personal impression - it looks good, haven't done a quantitative benchmark
Thanks for your constructive reply on my bad joke. I was referring to your original comment where you had a typo. I just couldn't resist, sorry.
Play with the Huggingface demo and I'm guessing this page is a little cherry-picked? In particular I am not getting that kind of emotion in my responses.
It is hard to get consistent emotion with this. There are some parameters, and you can go a bit crazy, but it gets weird…
I absolutely ADORE that this has swearing directly in the demo. And from Pulp Fiction, too!
> Any of you fucking pricks move and I'll execute every motherfucking last one of you.
I'm so tired of the boring old "miss daisy" demos.
People in the indie TTS community often use the Navy Seals copypasta [1, 2]. It's refreshing to see Resemble using swear words themselves.
They know how this will be used.
Heh, I always type out the first sentence or two of the Navy Seal copypasta when trying out keyboards.
You can run it for free here: https://huggingface.co/spaces/ResembleAI/Chatterbox
Sadly they don't publish any training or fine tuning code, so this isn't "open" in the way that Flux or Stable Diffusion are "open".
If you want better "open" models, these all sound better for zero shot:
Zeroshot TTS: MaskGCT, MegaTTS3
Zeroshot VC: Seed-VC, MegaTTS3
Granted, only Seed-VC has training/fine tuning code, but all of these models sound better than Chatterbox. So if you're going to deal with something you can't fine tune and you need a better zero shot fit to your voice, use one of these models instead. (Especially ByteDance's MegaTTS3. ByteDance research runs circles around most TTS research teams except for ElevenLabs. They've got way more money and PhD researchers than the smaller labs, plus a copious amount of training data.)
But whats the inference speed like on these? Can you use them in a realtime interaction with an agent?
Fun to play with.
It makes my Australian accent sound very English though, in a posh RP way.
Very natural sounding, but not at all recreating my accent.
Still, amazingly clear and perfect for most TTS uses where you aren't actually impersonating anyone.
A bit on the nose that they used a sample from a professional voice actor (Jennifer English) as the default reference audio file in that huggingface tool.
How does it work from the privacy standpoint? Can they use recorded samples for training?
Chatterbox is fantastic.
I created an API wrapper that also makes installation easier (Dockerized as well) https://github.com/travisvn/chatterbox-tts-api/
Best voice cloning option available locally by far, in my experience.
> Chatterbox is fantastic.
> I created an API wrapper that also makes installation easier (Dockerized as well) https://github.com/travisvn/chatterbox-tts-ap
Gave your wrapper a try and, wow, I'm blown away by both Chatterbox TTS and your API wrapper.
Excuse the rudimentary level of what follows.
Was looking for a quick and dirty CLI incantation to specify a local text file instead of the inline `input` object, but couldn't figure it.
Pointers much appreciated.
This API wrapper was initially made to support a particular use case where someone's running, say, Open WebUI or AnythingLLM or some other local LLM frontend.
A lot of these frontends have an option for using OpenAI's TTS API, and some of them allow you to specify the URL for that endpoint, allowing for "drop-in replacements" like this project.
So the speech generation endpoint in the API is designed to fill that niche. However, its usage is pretty basic and there are curl statements in the README for testing your setup.
Anyway, to get to your actual question, let me see if I can whip something up. I'll edit this comment with the command if I can swing it.
In the meantime, can I assume your local text files are actual `.txt` files?
This is way more of a response than I could have even hoped for. Thank you so much.
To answer your question, yes, my local text files are .txt files.
Ok, here's a command that works.
I'm new to actually commenting on HN as opposed to just lurking, so I hope this formatting works..
cat your_file.txt | python3 -c 'import sys, json; print(json.dumps({"input": sys.stdin.read()}))' | curl -X POST http://localhost:5123/v1/audio/speech \
-H "Content-Type: application/json" \
-d @- \
--output speech.wav
Just replace the `your_file.txt` with.. well, you get it.This'll hopefully handle any potential issues you'd have with quotes or other symbols breaking the JSON input.
Let me know how it goes!
Oh and you might want to change `python3` to `python` depending on your setup.
> Just replace the `your_file.txt` with.. well, you get it.
> This'll hopefully handle any potential issues you'd have with quotes or other symbols breaking the JSON input.
> Let me know how it goes!
Wow. I'm humbled and grateful.
I'll update once I'm done with work and back in front of my hone nachine.
Hey — just pushed a big update that adds an (opt-in) frontend to test the API
For now, there's just a textarea for input (so you'll have to copy the `.txt` contents) — but it's a lot easier than trying to finagle into a `curl` request
Let me know if you have any issues!
(Didn't carefully read your reply. What follows are the results of cat-ing a text file in the CLI. Will give the new textbox a whirl in the morning PDT. A truly heartfelt thanks for helping me work with Chatterbox TTS!)
Absolutely blown away.
I fed it the first page of Gibson's "Neuromancer" and your incantation worked like a charm. Thanks for the shell script pipe mojo.
Some other details:
- 3:01 (3 mins, 1 sec) of generated .wav took 4:28 to process
- running on M4 Max with 128GB RAM
- Chatterbox TTS inserted a few strange artifacts which sounded like air venting, machine whirring, and vehicles passing. Very odd and, oddly, apropos for cyberpunk.
- Chatterbox TTS managed to enunciate the dialog _as_ dialog, even going so far as to mimick an Australian accent where the speaker was identified as such. (This might be the effect of wishful listening.)
I am astounded.An M4 Max with 128GB RAM? drools
What did your `it/s` end up looking like with that setup? MLX is fascinating to me. Apple made a really smart decision with the induction of its M-series.
With regard to the artifacts — this is definitely a known issue with Chatterbox. I'm unsure of where the current investigation on fixing it is at (or what the "tricks" are to avoid this), but it's definitely something that is eery among other things.
I appreciate your feedback through all of this!
Would love to have you on the Discord to keep in touch https://chatterboxtts.com/discord