Rust implementation of Mistral's Voxtral Mini 4B Realtime runs in your browser

github.com

356 points by Curiositry 16 hours ago


HorizonXP - 13 hours ago

If folks are interested, @antirez has opened a C implementation of Voxtral Mini 4B here: https://github.com/antirez/voxtral.c

I have my own fork here: https://github.com/HorizonXP/voxtral.c where I’m working on a CUDA implementation, plus some other niceties. It’s working quite well so far, but I haven’t got it to match Mistral AI’s API endpoint speed just yet.

mentalgear - 11 hours ago

Kudos, this is were it's add: open-models running on-premise. Preferred by users and businesses. Glad Mistral's got that figured out.

simonw - 11 hours ago

I tried the demo and it looks like you have to click Mic, then record your audio, then click "Stop and transcribe" in order to see the result.

Is it possible to rig this up so it really is realtime, displaying the transcription within a second or two of the user saying something out loud?

The Hugging Face server-side demo at https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtim... manages that, but it's using a much larger (~8.5GB) server-side model running on GPUs.

radarsat1 - 3 hours ago

It's cool but do I really want a single browser tab downloading 2.5 GB of data and then just leaving it to be ephemerally deleted? I know the internet is fast now and disk space is cheap but I have trouble bringing myself around to this way of doing things. It feels so inefficient. I do like the idea of client-side compute, but I feel like a model (or anything) this big belongs on the server.

boutell - 2 hours ago

This stuff is cool. So is whisper. But I keep hoping for something that can run close to real time on a Raspberry Pi Zero 2 with a reasonable English vocabulary.

Right now everything is either archaic or requires too much RAM. CPU isn't as big of an issue as you'd think because the pi zero 2 is comparable to a pi 3.

Jayakumark - 14 hours ago

Awesome work, Would be good to have it work with handy.computer. Also are there plans to support streaming ?

zaptheimpaler - 10 hours ago

I don't know anything about these models, but I've been trying Nvidia's Parakeet and it works great. For a model like this that's 9GB for the full model, do you have to keep it loaded into GPU memory at all times for it to really work realtime? Or what's the delay like to load all the weights each time you want to use it?

mikebelanger - 5 hours ago

Neat, and neat to see the burn framework getting used. I tried this on latest Chromium, but my system froze until my OS killed Chromium. My VPN connection died right after downloading the model too. (it doesn't have a bandwidth cap either, so I'm not sure what's happening)

arkensaw - 8 hours ago

Look I think its great that it runs in the browser and all, but I don't want to live in a world where its normal for a website to download 2.5Gb in the background to run something

scronkfinkle - 4 hours ago

Nice!

I'm interested in your cubecl-wgpu patches. I've been struggling to get lower than FP32 safetensor models working on burn, did you write the patches to cubecl-wgpu to get around this restriction, to add support for GGUF files, or both?

I've been working on something similar, but for whisper and as a library for other projects: https://github.com/Scronkfinkle/quiet-crab

Retr0id - 14 hours ago

hm, seems broken on my machine (Firefox, Asahi Linux, M1 Pro). I said hello into the mic, and it churned for a minute or so before giving me:

panorama panorama panorama panorama panorama panorama panorama panorama� molest rist moundothe exh� Invothe molest Yan artist��������� Yan Yan Yan Yan Yanothe Yan Yan Yan Yan Yan Yan Yan

fusslo - 4 hours ago

I wonder if there's a metric or measure of how much jargon goes into a README or other document.

Reading the first three sentences of this README. 43 words, I would consider 15 terms to be jargon incomprehensible to the layman.

explosion-s - 4 hours ago

Just curious, is there any smaller version of this model capable of running on edge devices? Even my Mac M1 with 8gb ram couldn't run the C version.

ubixar - 7 hours ago

For those exploring browser STT, this sits in an interesting space between Whisper.wasm and the Deepgram KC client. The 2.5GB quantized footprint is notably smaller than most Whisper variants — any thoughts on accuracy tradeoffs compared to Whisper base/small?

another_twist - 3 hours ago

Uggh. I had just started working on this. Congratulations to the author !

- 9 hours ago
[deleted]
misiek08 - 9 hours ago

(no speech detected)

or... not talking anything generate random German sentences.

jszymborski - 14 hours ago

Man, I'd love to fine-tune this, but alas the huggingface implementation isn't out as far as I can tell.

Nathanba - 13 hours ago

I just tried it, I said "what's up buddy, hey hey stop" and it transcribed this for me: " وطبعا هاي هاي هاي ستوب" No, I'm not in any arabic or middle eastern country. The second test was better, it detected english.

TZubiri - 5 hours ago

Impressive, but to state the obvious, this is not yet practical for browser use due to it's (at least) 2.5GB memory footprint

refulgentis - 13 hours ago

Notable this isn't even close to realtime. M4 Max.

sergiotapia - 15 hours ago

>init failed: Worker error: Uncaught RuntimeError: unreachable

Anything I can do to fix/try it on Brave?