Audio is the one area small labs are winning

amplifypartners.com

168 points by rocauc 3 days ago


Taek - 24 minutes ago

My understanding is that this is purely a strategic choice by the bigger labs. When OpenAI released Whisper, it was by far best-in-class, and they haven't released any major upgrades since then. It's been 3.5 years... Whisper is older than ChatGPT.

Gemini 3 Pro Preview has superlative audio listening comprehension. If I send it a recording of myself in a car, with me talking, and another passenger talking to the driver, and the radio playing, me in English, the radio in Portuguese, and the driver+passenger in Spanish, Gemini can parse all 4 audio streams as well as other background noises and give a translation for each one, including figuring out which voice belongs to which person, and what everyone's names are (if it's possible to figure that out from the conversation).

I'm sure it would have superlative audio generation capabilities too, if such a feature were enabled.

tl2do - 8 hours ago

This matches my experience. In Kaggle audio competitions, I've seen many competitors struggle with basics like proper PCM filtering - anti-aliasing before downsampling, handling spectral leakage, etc.

Audio really is a blue ocean compared to text/image ML. The barriers aren't primarily compute or data - they're knowledge. You can't scale your way out of bad preprocessing or codec choices.

When 4 researchers can build Moshi from scratch in 6 months while big labs consider voice "solved," it shows we're still in a phase where domain expertise matters more than scale. There's an enormous opportunity here for teams who understand both ML and signal processing fundamentals.

umairnadeem123 - an hour ago

i buy the thesis that audio is a wedge because latency/streaming constraints are brutal, but i wonder if it's also just that evaluation is easier. with vision, it's hard to say if a model is 'right' without human taste, but with speech you can measure wer, speaker similarity, diarization errors, and stream jitter. do you think the real moat is infra (real-time) or data (voices / conversational corpora)?

nowittyusername - 6 hours ago

Good article and I agree with everything in there. For my own voice agent I decided to make him PTT by default as the problems of the model accurately guessing the end of utterance are just too great. I think it can be solved in the future but, I haven't seen a really good example of it being done with modern day tech including this labs. Fundamentally it all comes down to the fact that different humans have different ways of speaking, and the human listening to them updates their own internal model of the speech pattern. Adjusting their own model after a couple of interactions and arriving at the proper way of speaking with said person. Something very similar will need to be done and at very fast latency's for it to succeed in the audio ml world. But I don't think we have anything like that yet. It seems currently best you can do is tune the model on a generic speech pattern that you expect to fit over a larger percentage of the human population and that's about the best you can do, anyone who falls outside of that will feel the pain of getting interrupted every time.

hadlock - 2 hours ago

Can someone reccomend to me: a service that will generate a loopable engine drone for a "WWII Plane Japan Kawasaki Ki-61"? It doesn't have to be perfect, just convincing in a hollywood blockbuster context, and not just a warmed over clone of a Merlin engine sound. Turns out Suno will make whatever background music I need, but I want a "unique sound effect on demand" service. I'm not convinced voice AI stuff is sustainable

dkarp - 8 hours ago

There's too much noise at large organizations

giancarlostoro - 8 hours ago

OpenAI being the death star and audio AI being the rebels is such a weird comparison, like what? Wouldn't the real rebels be the ones running their own models locally?

mrbluecoat - 3 hours ago

The bigger players probably avoid it because it's a bigger legal liability: https://news.ycombinator.com/item?id=47025864

..plenty of money to be made elsewhere

bossyTeacher - 8 hours ago

Surprised ElevenLabs is not mentioned

amelius - 9 hours ago

Probably because the big companies have their focus elsewhere.

SilverElfin - 5 hours ago

Does Wisprflow count as an audio “lab”?

RobMurray - 4 hours ago

for a laugh enter nonsense at https://gradium.ai/

You get all kinds of weird noises and random words. Jack is often apologetic about the problem you are having with the Hyperion xt5000 smart hub.

lysace - 5 hours ago

Also: porn.

Audio is too niche and porn is too ethically messy and legally risky.

There's also music, which the giants also don't touch. Suno is actually really impressive.

throwjjj - 40 minutes ago

[dead]

lennexz - 3 hours ago

[dead]

anvevoice - 3 hours ago

[flagged]