Audio is the one area small labs are winning

168 points by rocauc 3 days ago

My understanding is that this is purely a strategic choice by the bigger labs. When OpenAI released Whisper, it was by far best-in-class, and they haven't released any major upgrades since then. It's been 3.5 years... Whisper is older than ChatGPT.

Gemini 3 Pro Preview has superlative audio listening comprehension. If I send it a recording of myself in a car, with me talking, and another passenger talking to the driver, and the radio playing, me in English, the radio in Portuguese, and the driver+passenger in Spanish, Gemini can parse all 4 audio streams as well as other background noises and give a translation for each one, including figuring out which voice belongs to which person, and what everyone's names are (if it's possible to figure that out from the conversation).

I'm sure it would have superlative audio generation capabilities too, if such a feature were enabled.

tl2do - 8 hours ago

This matches my experience. In Kaggle audio competitions, I've seen many competitors struggle with basics like proper PCM filtering - anti-aliasing before downsampling, handling spectral leakage, etc.

Audio really is a blue ocean compared to text/image ML. The barriers aren't primarily compute or data - they're knowledge. You can't scale your way out of bad preprocessing or codec choices.

When 4 researchers can build Moshi from scratch in 6 months while big labs consider voice "solved," it shows we're still in a phase where domain expertise matters more than scale. There's an enormous opportunity here for teams who understand both ML and signal processing fundamentals.

derf_ - 6 hours ago

Also, while the author complains that there is not a lot of high quality data around [0], you do not need a lot of data to train small models. Depending on the problem you are trying to solve, you can do a lot with single-digit gigabytes of audio data. See, e.g., https://jmvalin.ca/demo/rnnoise/
[0] Which I do agree with, particularly if you need it to be higher quality or labeled in a particular way: the Fisher database mentioned is narrowband and 8-bit mu-law quantized, and while there are timestamps, they are not accurate enough for millisecond-level active speech determination. It is also less than 6000 conversations totaling less than 1000 hours (x2 speakers, but each is silent over half the time, a fact that can also throw a wrench in some standard algorithms, like volume normalization). It is also English-only.
- tl2do - 6 hours ago
  
  [flagged]
isoprophlex - 41 minutes ago

I refuse to believe that none of these people ever heard of Nyquist, and that noone was able to come up with "ayyy lmao let's put a low pass on this before downsampling".
Edit: 2 day old account posting stuff that doesn't pass the sniff test. Hmmmm... baited by a bot?
- yalok - 29 minutes ago
  
  in ~30 years of my work in DSP domain, I've seen insane amount of ways to do signal processing wrong even for simplest things like passing a buffer and doing resampling.
  The last example I've seen in one large company, done by a developer lacking audio/DSP experience: they used ffmpeg's resampling lib, but, after every 10ms audio frame processed by resampler, they'd invoke flush(), just for the sake of convenience of having the same number of input and output buffers ... :)
nubg - 5 hours ago

AI bot comment
duped - 4 hours ago

imo audio DSP experts are diametrically opposed to AI on moral grounds. Good luck hiring the good ones. It's like paying doctors to design guns.
- nextaccountic - 2 hours ago
  
  Not sure your analogy works, Guantanamo had no trouble hiring medical personnel
jlehrer1 - 3 hours ago

[dead]

umairnadeem123 - an hour ago

i buy the thesis that audio is a wedge because latency/streaming constraints are brutal, but i wonder if it's also just that evaluation is easier. with vision, it's hard to say if a model is 'right' without human taste, but with speech you can measure wer, speaker similarity, diarization errors, and stream jitter. do you think the real moat is infra (real-time) or data (voices / conversational corpora)?

nowittyusername - 6 hours ago

Good article and I agree with everything in there. For my own voice agent I decided to make him PTT by default as the problems of the model accurately guessing the end of utterance are just too great. I think it can be solved in the future but, I haven't seen a really good example of it being done with modern day tech including this labs. Fundamentally it all comes down to the fact that different humans have different ways of speaking, and the human listening to them updates their own internal model of the speech pattern. Adjusting their own model after a couple of interactions and arriving at the proper way of speaking with said person. Something very similar will need to be done and at very fast latency's for it to succeed in the audio ml world. But I don't think we have anything like that yet. It seems currently best you can do is tune the model on a generic speech pattern that you expect to fit over a larger percentage of the human population and that's about the best you can do, anyone who falls outside of that will feel the pain of getting interrupted every time.

hadlock - 2 hours ago

Can someone reccomend to me: a service that will generate a loopable engine drone for a "WWII Plane Japan Kawasaki Ki-61"? It doesn't have to be perfect, just convincing in a hollywood blockbuster context, and not just a warmed over clone of a Merlin engine sound. Turns out Suno will make whatever background music I need, but I want a "unique sound effect on demand" service. I'm not convinced voice AI stuff is sustainable

nextaccountic - 22 minutes ago

you mean you want some ai product that generates sound effects from a textual prompt? elevenlabs has a model specifically for that
https://elevenlabs.io/sound-effects
almostdigital - 2 hours ago

https://elevenlabs.io/sound-effects
With the prompt "WWII Plane Japan Kawasaki Ki-61 flying by, propeller airplane" and setting looping on and 30 sec duration manually instead of auto (the duration predictor fails pretty bad at this prompt, you need to be logged in to set duration manually) it works pretty well. No idea if it's close to that specific airplane though it sounds like a ww2 plane to me though.

dkarp - 8 hours ago

There's too much noise at large organizations

United857 - 2 hours ago

I see what you did there.
echelon - 7 hours ago

They're focused on soaking up big money first.
They'll optimize down the stack once they've sucked all the oxygen out of the room.
Little players won't be able to grow through the ceiling the giants create.
- etherus - 5 hours ago
  
  Why would they do that? Once they have their win-condition, there's no reason to innovate. Only to reduce the costs of existing solutions. I expect that unless voice becomes a parameter which drives competition and first-choice for adoption, it will never become a focus of the frontier orgs. Which is curious to me, as almost the opposite of how I'm reading your comment.

giancarlostoro - 8 hours ago

OpenAI being the death star and audio AI being the rebels is such a weird comparison, like what? Wouldn't the real rebels be the ones running their own models locally?

wavemode - 26 minutes ago

I had a different issue with the metaphor - shouldn't OpenAI be the empire? The death star would be the thing they created, i.e. ChatGPT.
tl2do - 8 hours ago

True, but there's a fun irony: the Rebels' X-Wings are powered by GPUs from a company that's... checks relationships ...also supplying the Empire.
NVIDIA's basically the galaxy's most successful arms dealer, selling to both sides while convincing everyone they're just "enabling innovation." The real rebels would be training audio models on potato-patched RP2040s. Brave souls, if they exist.
- tadfisher - 27 minutes ago
  
  The company behind the T-65B X-Wing, Incom Corporation, did supply the Empire, as they did the Republic Navy before. By 0 BBY, Incom was nationalized by the Imperials. The X-Wing became the mainstay Alliance fighter because the plans were stolen by some defecting Incom engineers.
- garyfirestorm - 7 hours ago
  
  not sure about the irony - you can't really expect rebels to start their own weapons manufacturing lab right from converting ore into steel... these things are often supplied by a large manufacturer (which is often a monopoly) why is it any different for a startup to tap into nvidia's proverbial shovel in order to start digging for gold?
  - tl2do - 6 hours ago
    
    [flagged]
    
    adithyassekhar - 5 hours ago
    
    Your reply has a 0% ai score yet the presence of that "reply to" text is concerning.
    
    ta8903 - an hour ago
    
    Not very concerning because that post is obviously LLM generated.
    
    llbbdd - 3 hours ago
    
    I don't know why this would be surprising because anywhere purporting to be able to score like this is obviously lying to you

mrbluecoat - 3 hours ago

The bigger players probably avoid it because it's a bigger legal liability: https://news.ycombinator.com/item?id=47025864

..plenty of money to be made elsewhere

bossyTeacher - 8 hours ago

Surprised ElevenLabs is not mentioned

gorgoiler - 2 hours ago

I was suspicious that they are not mentioned, but then I realized this is a VC opinion piece and the first company mentioned joined their portfolio last year.
krackers - 7 hours ago

Also 15.ai [1]
[1] https://en.wikipedia.org/wiki/15.ai

amelius - 9 hours ago

Probably because the big companies have their focus elsewhere.

SilverElfin - 5 hours ago

Does Wisprflow count as an audio “lab”?

carshodev - 2 hours ago

Transcription providers like wisprflow and willow voice are typically providing nice UI/UX around open source models.
Wisprflow does not create it's own models but i know willow voice did do extensive finetuning to improve the quality and speed of their transcription models so you may count them.

RobMurray - 4 hours ago

for a laugh enter nonsense at https://gradium.ai/

You get all kinds of weird noises and random words. Jack is often apologetic about the problem you are having with the Hyperion xt5000 smart hub.

lysace - 5 hours ago

Also: porn.

Audio is too niche and porn is too ethically messy and legally risky.

There's also music, which the giants also don't touch. Suno is actually really impressive.

throwjjj - 40 minutes ago

[dead]

lennexz - 3 hours ago

[dead]

anvevoice - 3 hours ago

[flagged]