Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon

github.com

211 points by sanchitmonga22 16 hours ago


Hi HN, we're Sanchit and Shubham (YC W26). We built a fast inference engine for Apple Silicon. LLMs, speech-to-text, text-to-speech – MetalRT beats llama.cpp, Apple's MLX, Ollama, and sherpa-onnx on every modality we tested. Custom Metal shaders, no framework overhead.

Also, we've open-sourced RCLI, the fastest end-to-end voice AI pipeline on Apple Silicon. Mic to spoken response, entirely on-device. No cloud, no API keys.

To get started:

  brew tap RunanywhereAI/rcli https://github.com/RunanywhereAI/RCLI.git
  brew install rcli
  rcli setup   # downloads ~1 GB of models
  rcli         # interactive mode with push-to-talk
Or:

  curl -fsSL https://raw.githubusercontent.com/RunanywhereAI/RCLI/main/install.sh | bash
The numbers (M4 Max, 64 GB, reproducible via `rcli bench`):

LLM decode – 1.67x faster than llama.cpp, 1.19x faster than Apple MLX (same model files): - Qwen3-0.6B: 658 tok/s (vs mlx-lm 552, llama.cpp 295) - Qwen3-4B: 186 tok/s (vs mlx-lm 170, llama.cpp 87) - LFM2.5-1.2B: 570 tok/s (vs mlx-lm 509, llama.cpp 372) - Time-to-first-token: 6.6 ms

STT – 70 seconds of audio transcribed in *101 ms*. That's 714x real-time. 4.6x faster than mlx-whisper.

TTS – 178 ms synthesis. 2.8x faster than mlx-audio and sherpa-onnx.

We built this because demoing on-device AI is easy but shipping it is brutal. Voice is the hardest test: you're chaining STT, LLM, and TTS sequentially, and if any stage is slow, the user feels it. Most teams fall back to cloud APIs not because local models are bad, but because local inference infrastructure is.

The thing that's hard to solve is latency compounding. In a voice pipeline, you're stacking three models in sequence. If each adds 200ms, you're at 600ms before the user hears a word, and that feels broken. You can't optimize one stage and call it done. Every stage needs to be fast, on one device, with no network round-trip to hide behind.

We went straight to Metal. Custom GPU compute shaders, all memory pre-allocated at init (zero allocations during inference), and one unified engine for all three modalities instead of stitching separate runtimes together.

MetalRT is the first engine to handle all three modalities natively on Apple Silicon. Full methodology:

LLM benchmarks: https://www.runanywhere.ai/blog/metalrt-fastest-llm-decode-e...

Speech benchmarks: https://www.runanywhere.ai/blog/metalrt-speech-fastest-stt-t...

How: Most inference engines add layers between you and the GPU: graph schedulers, runtime dispatchers, memory managers. MetalRT skips all of it. Custom Metal compute shaders for quantized matmul, attention, and activation - compiled ahead of time, dispatched directly.

Voice Pipeline optimizations details: https://www.runanywhere.ai/blog/fastvoice-on-device-voice-ai... RAG optimizations: https://www.runanywhere.ai/blog/fastvoice-rag-on-device-retr...

RCLI is the open-source voice pipeline (MIT) built on MetalRT: three concurrent threads with lock-free ring buffers, double-buffered TTS, 38 macOS actions by voice, local RAG (~4 ms over 5K+ chunks), 20 hot-swappable models, and a full-screen TUI with per-op latency readouts. Falls back to llama.cpp when MetalRT isn't installed.

Source: https://github.com/RunanywhereAI/RCLI (MIT)

Demo: https://www.youtube.com/watch?v=eTYwkgNoaKg

What would you build if on-device AI were genuinely as fast as cloud?

stingraycharles - 16 hours ago

I’m a bit confused by what you’re offering. Is it a voice assistant / AI as described on your GitHub? Or is it more general purpose / LLM ?

How does the RAG fit in, a voice-to-RAG seems a bit random as a feature?

I don’t mean to come across as dismissive, I’m genuinely confused as to what you’re offering.

jonplackett - an hour ago

Really thought this was called Meta IRT and assumed it was just Facebook spyware.

vessenes - 16 hours ago

Just tried it. really cool, and a fun tech demo with rcli. I filed a bug report; not everything is loading properly when installed via homebrew.

Quick request: unsloth quants; bit per bit usually better. Or more generally UI for huggingface model selections. I understand you won't be able to serve everything, but I want to mix and match!

Also - grounding:

"open safari" (safari opens, voice says: "I opened safari") "navigate to google.com in safari" (nothing happens, voice says: "I navigated to google.com")

Anyway, really fun.

brainless - 40 minutes ago

I am interested in MetalRT. I am an indie builder, focused mostly on building products with LLM assistance that run locally. Like: https://github.com/brainless/dwata

I would be interested if MetalRT can be used by other products, if you have some plans for open source products?

jonhohle - 15 hours ago

If I send a Portfile patch, would you consider MacPorts distribution?

Reebz - 6 hours ago

Do you have plans to port your proprietary library MetalRT to mobile devices? These performance gains would be a boon for privacy-centric mobile applications.

mips_avatar - 12 hours ago

Have you tried any really big models on a mac studio? I'm wondering what latency is like for big qwens if there's enough memory.

rushingcreek - 14 hours ago

Very cool, congrats! I'm curious how you were able to achieve this given Apple's many undocumented APIs. Does it use private Neural Engine APIs or fully public Metal APIs?

Either way, this is a tremendous achievement and it's extremely relevant in the OpenClaw world where I might not want to have sensitive information leave my computer.

mnafees - 9 hours ago

Seems like you are leaking an ElevenLabs API key in your web demo. The OpenAI completions endpoint also has the API key in the request header but that seems to already be revoked and is returning a 401.

shekhar101 - 11 hours ago

Tried this and really liking it so far. Question - is there a diarization support in the tui app or any of the models MetalRt supports? Any plans to add it if not already supported?

shubham2802 - 11 hours ago

It does tries to have some memory management done too - to remember previous context + some auto compact feature.

Additionally, personality feature - try it out!! Super fun :)

brian-armstrong - 7 hours ago

What kind of self-disrespecting dev is using MacOS in TYOOL 2026?

Tacite - 16 hours ago

Doesn't work. " zsh: segmentation fault rcli"

tiku - 15 hours ago

Personally I'm so disappointed about the state of local AI. Only old models run "decent" but decent is way to slow to be usable.

woadwarrior01 - 9 hours ago

> Apple M3 or later required. MetalRT uses Metal 3.1 GPU features available on M3, M3 Pro, M3 Max, M4, and later chips. M1/M2 support is coming soon. On M1/M2, RCLI automatically falls back to the open-source llama.cpp engine.

So, no support for M5 Neural Accelerators, eh? (Requires Metal 4) ¯\_(ツ)_/¯

alfanick - 16 hours ago

I'm not looking for STT->AI->TTS, I'm looking for truly good voice-to-text experience* on Linux (and others). Siri/iOS-Dictation is truly good when it comes to understanding the speech. Something this level on Linux (and others) would be great, yeah always listening, maybe sending the data somewhere, but give me UX - hidden latency, optimizing for first chars recognized - a good (virtual) input device.

DetroitThrow - 16 hours ago

Wow, this is such a cool tool, and love the blog post. Latency is killer in the STT-LLM-TTS pipeline.

Before I install, is there any telemetry enabled here or is this entirely local by default?

RationPhantoms - 14 hours ago

This doesn't work on any of the methods I've tried.

jaimex2 - 5 hours ago

I don't have a Mac

computerex - 15 hours ago

Amazing, this is what I am trying to do with https://github.com/computerex/dlgo

tristor - 16 hours ago

> What would you build if on-device AI were genuinely as fast as cloud?

I think this has to be the future for AI tools to really be truly useful. The things that are truly powerful are not general purpose models that have to run in the cloud, but specialized models that can run locally and on constrained hardware, so they can be embedded.

I'd love to see this able to be added in-path as an audio passthrough device so you can add on-device native transcriptioning into any application that does audio, such as in video conferencing applications.

jawns - 14 hours ago

Based on the demo video, the TTS sounds like it's 10 years out of date. I would not enjoy interacting with it.

focusgroup0 - 15 hours ago

The fact that Apple didn't ship this in years after Siri acquisition is an indictment of its Product leadership

j45 - 15 hours ago

"Apple M3 or later required. MetalRT uses Metal 3.1 GPU features available on M3, M3 Pro, M3 Max, M4, and later chips. M1/M2 support is coming soon. On M1/M2, RCLI automatically falls back to the open-source llama.cpp engine."

john_strinlai - 15 hours ago

i knew i recognized this name from somewhere.

they are a company that registers domains similar to their main one, and then uses those domains to spam people they scrape off of github without affecting their main domain reputation.

edit: here is the post https://news.ycombinator.com/item?id=47163885

----

edit2: it appears that RunAnywhere is getting damage-control help by dang or tom.

this comment, at this time, has 23 upvotes yet is below 2 grey comments (i.e. <=0 upvotes) that were posted at roughly the same time (1 before, 1 after) -- strong evidence of artificial ordering by the moderators. gross.

pzo - 14 hours ago

FWIW this RCLI is only MIT license but their engine MetalRT is commercial. Not sure the license of their models I guess also not MIT. So IMHO this repo is misleading.

Not sure why they decided to reinvent the wheel and write yet another ML engine (MetalRT) which is proprietary. I would most likely bet on CoreML since it have support for ANE (apple NPU) or MLX.

Other popular repos for such tasks I would recommend:

https://github.com/FluidInference/FluidAudio

https://github.com/DePasqualeOrg/mlx-swift-audio

https://github.com/Blaizzy/mlx-audio

https://github.com/k2-fsa/sherpa-onnx

7kmph - 8 hours ago

this is the company that cold emailed many people via email on GitHub.

david_shaw - 15 hours ago

I think the title should read "RunAnywhere," not "RunAnwhere."

Imustaskforhelp - 16 hours ago

I am just gonna link the stats of this hackernews post[0] and let public decide the rest because for context, this is same company which was mentioned in a blow-up post 12 days ago which had gotten 600 upvotes and they didn't respond back then[1] (I have found it hard for posts to have such a 2x factor within minutes of posting, that's just my personal observation. Usually one gets it after an hour or two or three.)

I was curious so I did some more research within the company to find more shady stuff going on like intentionally buying new domains a month prior to send that spam to not have the mail reputation of their website down. You can read my comment here[2]

Just to be on the safe side here, @dang (yes pinging doesn't work but still), can you give us some average stats of who are the people who upvoted this and an internal investigation if botting was done. I can be wrong about it and I don't ever mean to harm any company but I can't in good faith understand this. Some stats

Some stats I would want are: Average Karma/Words written/Date of the accounts who upvoted this post. I'd also like to know what the conclusion of internal investigation (might be) if one takes place.

[There is a bit of conflicts of interest with this being a YC product but I think that I trust hackernews moderator and dang to do what's right yeah]

I am just skeptical, that's all, and this is my opinion. I just want to provide some historical context into this company and I hope that I am not extrapolating too much.

It's just really strange to me, that's all.

[0]: https://news.social-protocols.org/stats?id=47326101 (see the expected upvotes vs real upvotes and the context of this app and negative reception and everything combined)

[1]: Tell HN: YC companies scrape GitHub activity, send spam emails to users: https://news.ycombinator.com/item?id=47163885

[2]:https://news.ycombinator.com/reply?id=47165788

- 15 hours ago
[deleted]
samuel_grupa_ai - 13 hours ago

[flagged]

dsalzman - 15 hours ago

[flagged]

iharnoor - 15 hours ago

[flagged]

josuediaz - 15 hours ago

[flagged]

sidv1711_ - 10 hours ago

Let's goo!!