Pure C, CPU-only inference with Mistral Voxtral Realtime 4B speech to text model

267 points by Curiositry 16 hours ago

I use the open source Handy [1] app with Parakeet V3 for STT when talking to coding agents and I’ve yet to see anything that beats this setup in terms of speed/accuracy. I get near instant transcription, and the slight accuracy drop is immaterial when talking to AIs that can “read between the lines”.

I tried incorporating this Voxtral C implementation into Handy but got very slow transcriptions on my M1 Max MacBook 64GB.

[1] https://github.com/cjpais/Handy

I’ll have to try the other implementations mentioned here.

mythz - 8 hours ago

Big fan of Salvatore's voxtral.c and flux2.c projects - hope they continue to get optimized as it'd be great to have lean options without external deps. Unfortunately it's currently too slow for real-world use (AMD 7800X3D/Blas) when adding Voice Input support to llms-py [1].

In the end Omarchy's new support for voxtype.io provided the nicest UX, followed by Whisper.cpp, and despite being slower, OpenAI's Whisper is still a solid local transcription option.

Also very impressed with both the performance and price of Mistral's new Voxtral Transcription API [2] - really fast/instant and really cheap ($0.003/min), IMO best option in CPU/disk-constrained environments.

[1] https://llmspy.org/docs/features/voice-input

[2] https://docs.mistral.ai/models/voxtral-mini-transcribe-26-02

antirez - 5 hours ago

Hi! This model is great, but it is too big for local inference, Whisper medium (the "base" IMHO is not usable for most things, and "large" is too large) is a better deal for many environments, even if the transcription quality is noticeable lower (and even if it does not have a real online mode). But... It's time for me to check the new Qwen 0.6 transcription model. If it works as well as their benchmarks claim, that could be the target for very serious optimizations and a no deps inference chain conceived since the start for CPU execution, not just for MPS. Since, many times, you want to install such transcription systems on server rent online via Hetzner and other similar vendors. So I'm going to handle it next, and if it delivers, really, time for big optimizations covering specifically the Intel, AMD and ARM instructions sets, potentially also thinking at 8bit quants if the performance remain good.
- dust42 - 5 hours ago
  
  Same experience here with Whisper, medium is often not good enough. The large-turbo model however is pretty decent and on Apple silicon fast enough for real time conversations. The addition of the prompt parameter can also help with transcription quality, especially when using domain specific vocabulary. In general Whisper.cpp is better with transcribing full phrases than with streaming.
  And not to forget, for many use cases more than just English is needed. Unfortunately right now most STT/ASR and TTS focus on English plus 0-10 other languages. Thus being able to add with reasonable effort more languages or domain specific vocabulary would be a huge plus for any STT and TTS.
grigio - an hour ago

+1 for voxtype with Whisper-base model it is quite fast an accurate
mijoharas - 8 hours ago

One thing I keep looking for is transcribing while I'm talking. I feel like I need that visual feedback. Does voxtype support that?
(I wasn't able to find anything at glance)
Handy claims to have an overlay, but it seems to not work on my system.
- mythz - 8 hours ago
  
  Not sure how it works in other OS's but in Omarchy [1] you hold down `Super + Ctrl + X` to start recording and release it to stop, while it's recording you'll see a red voice recording icon in the top bar so it's clear when its recording.
  Although as llms-py is a local web App I had to build my own visual indicator [2] which also displays a red microphone next to the prompt when it's recording. It also supports both Tap On/Off and hold down for recording modes. When using voxtype I'm just using the tool for transcription (i.e. not Omarchy OS-wide dictation feature) like:
  $ voxtype transcribe /path/to/audio.wav
  If you're interested the Python source code to support multiple voice transcription backends is at: [3]
  [1] https://learn.omacom.io/2/the-omarchy-manual/107/ai
  [2] https://llmspy.org/docs/features/voice-input
  [3] https://github.com/ServiceStack/llms/blob/main/llms/extensio...
  - mijoharas - 4 hours ago
    
    Ah, the thing I really want is to see the words that I'm speaking being transcribed (i.e. realtime) For some reason I rarely see that feature.
    
    bmn__ - 3 hours ago
    
    The more things change…
    https://news.ycombinator.com/item?id=21711755
    
    mijoharas - 3 hours ago
    
    hahaha! plus ca change indeed.
    (I keep coming back to this one so I've got half a dozen messages on HN asking for the exact same thing!).
    It's a shame, whisper is so prevalent, but not great at actual streaming, but everyone uses it.
    I'm hoping one of these might become a realtime de facto standard so we can actually get our realtime streaming api (and yep, I'd be perfectly happy with something just writing to stdout. But all the tools always end up just batching it because it's simpler!)
- Doman - 5 hours ago
  
  I am using a window manager with Waybar. Voxtype can display a status icon on Waybar [1], it is enough for me to know what is going on.
  [1] https://github.com/peteonrails/voxtype/blob/main/docs/WAYBAR...

Curiositry - 14 hours ago

This was a breeze to install on Linux. However, I haven't managed to get realtime transcription working yet, ala Whisper.cpp stream or Moonshine.

--from-mic only supports Mac. I'm able to capture audio with ffmpeg, but adapting the ffmpeg example to use mic capture hasn't worked yet:

ffmpeg -f pulse -channels 1 -i 1 -f s16le - 2>/dev/null | ./voxtral -d voxtral-model --stdin

It's possible my system is simply under spec for the default model.

I'd like to be able to use this with the voxtral-q4.gguf quantized model from here: https://huggingface.co/TrevorJS/voxtral-mini-realtime-gguf

jwrallie - 12 hours ago

I am interested in a way to capture audio not only from the mic, but also from one of the monitor ports so you could pipe the audio you are hearing from the web directly for real-time transcription with one of these solutions. Did anyone manage to do that?
I can, for example, capture audio from that with Audacity or OBS Studio and do it later, so it should be possible to do it in real time too assuming my machine can keep up.
- bebna - 10 hours ago
  
  Set -i 1 to -i default or to one of your monitors, look them up with pactl list short sources
  https://trac.ffmpeg.org/wiki/Capture/PulseAudio
yjftsjthsd-h - 13 hours ago

Does it work if you use ffmpeg to feed it audio from a file? I personally would try file->ffmpeg->voxtral then mic->ffmpeg->file, and then try to glue together mic->ffmpeg->voxtral.
(But take with grain of salt; I haven't tried yet)
- Curiositry - 10 hours ago
  
  Recording audio with FFMPEG, and transcribing a file that’s piped from FFMPEG both work.
  Given that it took 19.64 mins to transcribe the 11 second sample wav, it’s possible I just didn’t wait long enough :)
  - yjftsjthsd-h - 9 hours ago
    
    Ah. In that case... Yeah. Is it using GPU, and does the whole model fit in your (V)RAM?
    
    ekianjo - 8 hours ago
    
    This is a CPU implementation only.
    
    yjftsjthsd-h - 2 hours ago
    
    Oh, that's interesting. The readme talks about GPU acceleration on Apple Silicon and I didn't see anything explicit for other platforms, so I assumed it needs GPU everywhere, but it does BLAS acceleration which a web search seems to agree is just a CPU optimized math library. That's great; should really increase the places where it's useful:)

written-beyond - 9 hours ago

Funny, this and the Rust runtime implementation are neck and neck on the frontpage right now.

Cool project!

hrpnk - 8 hours ago

There is also a MLX implementation: https://github.com/awni/voxmlx

sgt - 10 hours ago

I'm very interested in speech to text - but like tricky dialects and use of various terminologies but I'm still confused as to where to start in the best possible place, in order to train the models with a huge database of voice samples I own.

Any ideas from the HN crowd currently involved in speech 2 text models?

9999_points - an hour ago

It seems so bizarre that we need a nearly 9gb model to do something you could do over 20 years ago with ~200mb.

- 11 hours ago

[deleted]

alextray812 - 5 hours ago

From a cybersecurity perspective, this project is impressive not just for performance, but for transparency.

sylware - 6 hours ago

Finally a plain and simple C lib to run LLM opened weights?

MORPHOICES - 8 hours ago

[dead]

genie3io - 9 hours ago

[dead]