Apple Core AI Framework
developer.apple.com190 points by hmokiguess 7 hours ago
190 points by hmokiguess 7 hours ago
This is why the AI companies are rushing to IPO. By the end of next year you’ll be running most of your AI on device. They have no moat, they’ve reached the limits of scaling, most of the magic can be distilled into smaller models, and they know it
Qwen's ~30B-class models are genuinely good enough for use if you can find a machine with enough memory bandwidth to run them at 30-90 tokens/second. It's been extremely telling that Qwen stopped releasing 120b class models. At some point in the next 10 years (maybe 3?) someone is going to release an Opus 4.5 class 256B model you can run locally. Right now our engineers use about $800/mo worth of opus tokens; at that rate the ROI for local LLM is ~10 months
Didn't Qwen stop releasing their more powerful models because they're commercializing them?
Have we reached the limits of scaling? Sadly it appears that larger model still equals better model
Well, let's not forget that text models are not the only models! Video models are much slower and need comparatively more resources, and all they can do even at that size is generate videos a few seconds long. Clearly a ton more work is going to go into those, and demand for them will probably increase as more creative tools get authored using them as a central part of the workflow. Low-res local rendering for preview might be a thing, but the lion's share of the work for high-res, near-realtime rendering is going to be done on huge clusters for a long time yet.
I think there’s still an open question around are the ultra-large next-gen models worth it? For those of us without early access to Mythos, it’s hard to verify whether it’s been held back from the public due to actually being “too dangerously powerful to release yet” as implied or because the gains aren’t outpacing the costs.
I think GPT 4.5 showed that there is indeed a practical limit we're close too. That was supposedly a high-trillions of parameter model that was deprecated almost immediately because it was slow, insanely expensive, and had questionable benefits over the smaller models. Though apparently the new Mythos and whatever GPT Spud is (if it wasn't 5.5) are back up in the high trillions.
Actually having used it a bit, I'm quite excited to see a modern model of similar size.
I think what people didn't realize was, just because the GPT-4.5 model didn't get better on the benchmarks, didn't mean the model wasn't different than the earlier models. It was being compared to thinking models that were being developed at the same time.
The GPT 4.5 model still has some of the most "human" like abilities in communication even though it isn't particularly good a problem solving. It hadn't under gone the same type of reinforcement training.
I still use GPT 4.5 sometimes, in creative exercises it can be surprisingly effective. The model is still available.
> By the end of next year you’ll be running most of your AI on device.
I expect I'll probably keep paying for whatever badass high IQ model is running on inference servers at that point
I just want a tiny tiny model that runs on device that knows for autocomplete that, for example, I want to say "I'll be right back" instead of "I'll be right Brian". That's my #1 AI ask right now. Please, Apple.
I want Siri to let me “add to my calendar, dinner Peter’s house Sunday at 5pm” and not assume the location is the restaurant called Peter’s House in another state. It’s astounding how poor Siri is at using the data I’ve given it access to
Very false.
I use small models exclusively. They aren't a replacement for large models. You need decent hardware to run those models efficiently, as smaller parameter models plain suck and are still slow on macbooks. And affordability of higher end hardware is very limited.
Even at non VC subsidized $/token prices, its still much cheaper to run cloud based models.
> Even at non VC subsidized $/token prices, its still much cheaper to run cloud based models.
On a price-per-wattage level, this is not true, people have done the math on /r/LocalLLaMA many times over[1]. Local models, while not as good as premier models (GPT 5.5, etc.), are like ~80%+ of the way there, and often converge to a similar solution after a few dead ends.
[1] https://www.reddit.com/r/LocalLLM/comments/1kshq4f/electrici...
Maybe not per watt, but unless you already happen to own a 3900 cited by that post, you'd have to buy that as well, which is currently selling for around $1400 used.
3090s are running $1400 now? Wowsers. I thought I was overspending when I bought 6x of them for around $800 a pop.
Might be time to sell, to be honest. It's fun to have that at home, but I can't justify having $10k (with memory, mobo, cpu, etc) sitting in my basement without being fully utilized.
I do have a 3090 Ti on my gaming PC, but even my old M1 MBP (with a mere 32gb of RAM) is quite competent and can run a quantized `Gemma4-26B-A4B` in the background while I do other stuff.
well to be fair that's right now, I think the question is what about in 6 months, 12 months, 2 years?
Where do these improvement curves go? Does the gap close, do they intersect for practical purposes (factoring in cost etc)? Or is the local curve always just a translation of the hosted, lagging behind, or indeed does hosted just pull ahead?
Nobody knows, but it's a very open question I feel, and it certainly appears like the answer might quite reasonably be that yes they intersect on that kind of short-ish term time horizon.
>Where do these improvement curves go?
Nowhere.
Large models haven't seen that much improvement, just small unique tasks performance which is all special cased RLed to game metrics
For local models, its the same story. You can download Gemma 3 QAT from last year, and it will be just as good as Gemma:31b on the average. Qwen also boasts that its better, because again, they RLed it to game some metrics. Its better in coding then Gemma, but Gemma is better in more creative thinking (again, all RL)
Fundamentally, you need detail in the gradients for the models to pick up on the smaller details. If you don't have those, your output is gonna suck. No amount of clever architecture is going to fix this.
The only way to improve local models by training them to fetch context, and then their job becomes much simpler because all they need to do is reinterpret the fetched content and provide an answer. But fundamentally, if you are trying to keep things in house for advertising purposes like what all companies do with search, you want them to go to your service, which means running on your servers. And its not really that much extra per invocation (i.e excluding initial hardware costs) to instead just offer a large model as a service, which will be way better than any small models.
i am more excited about the ondevice foundation model update that is coming https://developer.apple.com/documentation/updates/foundation... (not much info yet)
but i maintain https://github.com/Arthur-Ficial/apfel so i might be biased
Have you seen that they've added an `fm` tool? It was mentioned in the Platforms State of the Union.
Thanks for building this! Something I grab on a regular basis, especially for doing simple education of folks about the basics of using LLMs by showing something that's not just a chatbot.
Apfel is very useful, thanks for the effort.
I second this, I’m more excited about dumb local models than something I could never run locally.
WWDC 2026 Core AI videos
Meet Core AI - https://developer.apple.com/videos/play/wwdc2026/324/
Dive into Core AI model authoring and optimization - https://developer.apple.com/videos/play/wwdc2026/325/
Integrate on-device AI models into your app using Core AI - https://developer.apple.com/videos/play/wwdc2026/326/
Wow, this seems to be a new way to convert PyTorch models to a format that runs across CPU, GPU & Apple's Neural Engine (ANE). [0]
Does this completely replace the previous API, CoreML? [1]
[0]: https://apple.github.io/coreai-optimization/
[1]: https://developer.apple.com/documentation/coreml/Yes. From the CoreAI docs:
"If your app uses model types other than neural networks, such as decision trees or tabular feature engineering, see Core ML."
This is just a bit exciting, although I wonder how the performance of this will stack up next to the stuff we already do with, e.g., a metal-optimised model which we then load into llama-cpp or whatever. (unsloth is a good example of doing this for you "batteries included").
seems they planning to replace it but overall now I'm really confused about this and mlx and coremltools. They should do better work explaining the benefits (and cons) of it and any feature parity between coreai, coreml and mlx.
My reading of it is:
- Core ML is for models designed only for Apple platforms
- MLX is for models that don't need to be fast
- Core AI is for models that run everywhere already and also need to be fast
Do we know what is the underlying model? Is it a custome model developed by Apple or one of gemma/deepseeks under the hood
AI future is clearly local, and my recent pitch has been "infinite tokens." Because that's what my M1 MBP can do; and that's what my RTX3090 can do. I don't need to pay hundreds of dollars a month and no one else does either.
Where 'infinite' has the unusual meaning of 'strictly fewer than available from cloud services in the same amount of time or dollars', at least for the medium term.
You already paid for the 3090, you pay roughly proportional to token count for the power it consumes, it produces them slowly, and there's a discounted value of getting a token later rather than sooner.
A differentiating value prop of local models is privacy, which may be forced on you by regulation (eg. other people's medical data), or by your preferences. But infinite tokens it ain't.
Is there something like this on Linux? For example, if I’m an application developer can I assume GNU Core AI (or whatever it is or would be called) will be there if the kernel is >= some particular version?
On non-Apple platforms, you generally have at least 2+(number of supported silicon vendors) different AI frameworks to worry about. I guess Apple's there now too, between Core ML, MLX, Core AI.
I haven't seen any sign that the framework fragmentation problem is going away anytime soon. NVIDIA wants everyone to do all training and inference with CUDA and to deny that NPUs have any usefulness. Everybody making an NPU has a different framework tailored to their architecture and the limitations they inherited from hardware designed before LLMs existed, and most of them have a another framework for targeting a GPU. And the OS vendor has one or two frameworks they would prefer you use rather than something hardware-specific.
For practical purposes llama.cpp is this. You can link to it or use the network API.