Local AI needs to be the norm

1888 points by cylo 6 days ago

They will be, and that moment is not that far off. We've got the progression in place already: first, large data centers could have performant LLMs, we are now firmly in "a bunch of servers with a couple of H100s each" territory, slowly going into "128 GB VRAM on a MacBook Pro or a Strix Halo". Within the next year, the pattern of "expensive remote LLM for planning, local slow-but-faster-than-human LLM for execution" will become the norm for companies, slowly moving to "using local LLM for everything is good enough". And then we'll have the equilibrium we already have with the "classic cloud": you either self-host or pay for flexibility and speed. The question will be: how much of the current compute capacity craze will local hosting give the kiss of death to and what that means for the market.

reisse - 6 days ago

> They will be, and that moment is not that far off.
It's here, right now. I'm running quantized Qwen and Gemma on a decent, but three years old gaming rig (think RTX 3080 12GB and 32 GB RAM). Yes, it's slow, it has a small context window. But it can (given a proper harness) run through my trip photos and categorize them. It can OCR receipts and summarize spendings. It can answer simple questions, analyze code and even write code when little context is required. Probably I could get a half-decent autocomplete out of it, if I bother with VS Code integration. "128 GB VRAM on a MacBook Pro or a Strix Halo" is already a minimum viable setup for agentic coding, I think.
> And then we'll have the equilibrium we already have with the "classic cloud": you either self-host or pay for flexibility and speed.
Currently, it works exactly the other way. The cloud versions are orders of magnitude cheaper than self hosting, because sharing can utilize servers much more efficiently. Company can spend half a million bucks on a rig running GLM 5.1, and get data security, flexibility and lack of censorship, but oh it's so expensive compared to Anthropic per-seat plans.
- pbgcp2026 - 6 days ago
  
  I'm sorry to spoil it for you, but Perl script was able to do all of that like ... 10 years ago? The out-of-the-box Shotwell manages photos quite well without any intelligence. The problem, as people mentioned above, is SOTA models cognitive and tooling abilities. Also, have you noticed as top-end Mac Studios got downgraded recently? They don't want you to have access to frontier models. And you will not have it. See Mythos as Exibit A.
  - jclardy - 6 days ago
    
    The Mac Studio's disappearance is related to the fact that people now want them for the purpose of running local models. Supply and demand. That plus Apple doesn't shift prices for released products, and it essentially became underpriced when large RAM quantities exploded in price. For the price of 512GB of RAM alone you could get an M3 Ultra with 512GB of unified memory in a nice, quiet, and power efficient package. With the RAM you still need to spend a few thousand more on CPU/GPU, power supplies, storage and case.
    Also the fact that an M5 version will be coming, and they likely know they are going to sell out on day one (I expect we'll see a price correction from Apple for higher end configs of M5 studios, base price will probably stay the same), so they need to build up stock reserves.
    
    lowbloodsugar - 5 days ago
    
    512GB of ram with I think 600GB/s access. It’s the bandwidth that makes the studio killer for inference.
  - zigzag312 - 5 days ago
    
    > The out-of-the-box Shotwell manages photos quite well without any intelligence.
    This piqued my interest on how it does it and after briefly checking the project it seems it only has two features for automatic photo categorization. 1) it can group photos by date and 2) It has face detection and recognition that uses trained weights (so ML "intelligence").
    
    mystraline - 5 days ago
    
    Immich (server) also has a whole host of ML features for classification as well.
    I got away from google images and upload to my own Immich instance.
    I also use an open source camera app on fdroid to degoogle that whole path.
  - IMTDb - 5 days ago
    
    > They don't want you to have access to frontier models. And you will not have it. See Mythos as Exibit A.
    "They" fully well know that they current frontier model are maybe 6 month ahead of what people will have access to without their control. See Deepseek as Exibit B
    The reason you can't run these locally are more with the fact that those mythos sized models require extreme amount of memory and processing power to run at acceptable speeds. And neither you, nor I can afford to pay for those resources to run those models locally. A big reason is that "running locally" means running on your own hardware. And for almost everyone this means "running on hardware that will spent a big portion of its time just sleeping". Because data center and providers have higher utilization rates, they can easily outpace you. That and the fact that when they place an order it's usually for hundreds of thousands of units.
    
    PeterStuer - 5 days ago
    
    I am convinced the (mainly chinese) open weights models are the only reason OpenAI and Anthropic release at the pace they do. Without them being on their heels, we would have seen a stagnant duopoly in terms of public releases.
    That is why the huge lobby machine is grinding away to make those models illegal.
    
    bee_rider - 5 days ago
    
    Although, I wonder how many orders of magnitude in terms of affordability the utilization rate actually gets them. Realistically if you use a self-hosted LLM for your job, you might be using it, what, a solid 6 hours per day? Assuming you can keep it actually fed, while working (so, some agentic thing might be necessary, I guess it will need to be more than VSCode autocomplete and responding to individual prompts). Anyway, that starts you out at 1/4’th the utilization, a 4X price increase might be worth paying for privacy and stability (no sudden change in model behavior, no price changes, no days when the system is over-utilized for reasons outside your control).
    Rather I think it is just hard for local LLMs to compete in this early stage when the cloud providers are allowed by investors to be unprofitable.
    
    zozbot234 - 5 days ago
    
    > Realistically if you use a self-hosted LLM for your job, you might be using it, what, a solid 6 hours per day?
    You can grow the utilization rate well beyond that if you don't always care about getting a quick, real-time response. (And if you do, then maybe the cloud model was the better deal after all!)
    
    hedora - 5 days ago
    
    Isn't Mythos that screw up where Anthropic failed to ship something that was no better than the product OpenAI launched a few weeks later?
    And, assuming the allegations are true, don't things like Deepseek and Qwen offer existence proofs that frontier models are (and will forever be) trivially distilled down to run domain-specific tasks on boxes that cost a few months of Claude Max subscription?
  - Hamuko - 6 days ago
    
    >Also, have you noticed as top-end Mac Studios got downgraded recently? They don't want you to have access to frontier models. And you will not have it.
    Isn't that a function of RAM supply not being available now?
    
    aceazzameen - 5 days ago
    
    OpenAI did buy out the RAM supply to block competition. Arguably local models are one of its (smaller) competitors.
    Even if that weren't the case, every corp _needs_ you to be on a subscription.
    
    Hamuko - 5 days ago
    
    They didn't really even buy the RAM. But there's pretty significant demand for RAM in general with data centers being planned left and right.
  - tjoff - 6 days ago
    
    Do we even have decent OCR nowadays? Any free solutions?
    
    Farmadupe - 5 days ago
    
    The latest rounds of open weights vision language models are incredibly good. Like, massively good. Open weights vision capabilities trade blows with frontier models. Over the last few months I'd roughly rank capabilities as Gemini -> {chatgpt and SoTa open weights models} -> Claude.
    qwen3.5-2b and qwen3.5-4b are great at document parsing. They can run on CPU
    qwen3.6-27b and gemma4-31b are borderline better than the human eye in some cases. Their OCR isn't perfect, but they're seriously good. They can still run on the CPU but you'll be waiting minutes per document.
    You can demand JSON, YAML, MD, or freeform text just by varying the prompt. Even if you have a custom template, you can just put that in the prompt and they'll do an OK-ish job.
    There's also models that aren't in the r/locallama zeitgeist. IBM released a new 4b parameter model for structured text extraction last week, and there's a sea of recent chinese OCR models too.
    IMO the open wights models are so good that in a lot of cases it's not worth paying frontier labs for OCR purposes. The only barrier to entry is the effort to set up a pipeline, and havin the spare CPU/GPU capacity.
    
    adrian_b - 5 days ago
    
    Many of the open-weights LLMs accept either text or images as input.
    Besides those, there are a few smaller open-weights models that are dedicated for OCR tasks, for instance DeepSeek-OCR-2 and IBM granite-vision-4.1-4b. (They can be found on huggingface.co)
    The dedicated vision models can be run on much cheaper hardware, including smartphones, than the big models that can process images besides text.
    Similarly, besides bigger multimodal models, that can accept audio, images or text as imput, there are smaller open-weights models that are dedicated for speech recognition, e.g. Xiaomi MiMo-V2.5-ASR and IBM granite-speech-4.1-2b.
    
    PeterStuer - 5 days ago
    
    Depends on your use case. My procuction runs satisfactory on a local docling-serve ( https://github.com/docling-project/docling-serve ), but that is mostly easy relatively clean scans of decently typeset documents with some typical scanning artefacts.
    
    lrvick - 5 days ago
    
    The qwen models not only have good OCR, they will describe pictures to you.
    
    mapt - 5 days ago
    
    Anyone wanna do a quick offline MVP on a general vision assistant for the blind? We've had things like Google Lens for a while, but it's a bit vision and touchscreen-centric.
  - woctordho - 5 days ago
    
    API for Mythos and GPT Cyber are circulating in the market (That's also why we can use Claude and GPT in China). The open source community has been advancing subscription engineering for a long time, and I don't think Anthropic or OpenAI have any technical advantage in this field.
  - JonGretarB - 6 days ago
    
    Huh? Why would Apple not want you to be able to run local models? They have very deliberately stayed the hell away from this space.
  - ubercore - 6 days ago
    
    The conspiracy angle here is not really relevant. Ram is expensive and they're gearing up for M5 studios. Not the illuminati keeping better LLM models out of your hands.
    
    lkjdsklf - 5 days ago
    
    They did decrease the memory bandwidth for.... reasons... which didn't make much sense.. but yeah this is some pretty weird conspiracy stuff.
    Apple doesn't even sell a model. They just have a deal to use Googles. They can't "protect" their cloud version of a model they don't have.
  - raincole - 5 days ago
    
    You think Apple doesn't want you to use local models?
    That's an interesting way to view the world. I mean, utterly stupid as it is, but interesting.
    But the previous sentence is even stupider (a Perl script 10 years ago could write code like Qwen does now?), so I guess at least it's consistent.
- digitaltrees - 6 days ago
  
  I built my own IDE and run my own model specifically to have private agentic coding. I can still access model APIs but I can be purely local if I want too. It’s amazing.
  - manmal - 6 days ago
    
    Curious, why did Zed with ACP not work for you?
    
    digitaltrees - 5 days ago
    
    Because I wanted the full ide on my iPhone so I can code while away from my laptop doing fun stuff with my kids. And I don’t like the Claude codex fire and forget approach.
    The ide I built has a full terminal, file system, git integration and AI agent. It uses a private cloud Linux container that is persistent so I can install packages and do anything I want from any phone, computer or browser. It’s amazing that we live in a time where we can build custom software for ourselves just for fun. I will never have to worry about cursor or vs changing getting bought and moth balled like Atom (my favorite ide). I now own my tool and will forever.
    
    fud101 - 4 days ago
    
    Literally will break overnight when some key dependency changes. Your LLM might not be able to fix it. Then i guess you regenerate it all from scratch? Sounds exhausting tbh.
    
    digitaltrees - 4 days ago
    
    I’ve built enterprise software for 10 years with multiple upgrades over that time. With good test coverage and the right abstractions maintenance is feasible.
    Also, because I wrote and own the code I don’t have to update if I don’t want to. I could choose instead to build around the dependency. That’s much more control over than when Microsoft bought GitHub and destroyed the Atom ide which I loved in favor of vscode which I still hate
    
    Fokamul - 6 days ago
    
    I'm just guessing, but IDE which is using 3D acceleration just for stupid UI to run "smoothly", that is ridiculous.
    Who runs IDE with LLM agents accessing your local filesystem, on bare metal?
    Or am I alone to run everything LLM related on my VM just for development work. Then because of ZED genius decision, you need to share your GPU to VM, then some important features will not work, like snapshots. So you also need workaround for this, etc.
    Too much hassle, Zed is not for me.
    But I'm anti-Apple, so maybe that's the reason :)
    Btw, even "ImHex" devs realized this and they're providing version without acceleration for VM use. They're using ImGui. Using it for local desktop app UI is also ridiculous, imho. Whatever.
    
    oslem - 5 days ago
    
    I would imagine running a local LLM for development isn’t as popular as using a hosted provider. I don’t personally host a local model, but I have shared GPUs and storage volumes with VMs and I didn’t see it as that much of a hassle. What kinds of problems are you running into?
    Doesn’t ghostty also use graphics acceleration? I was under the impression that rendering text is a relatively challenging graphics compute task.
    
    digitaltrees - 5 days ago
    
    I run local LLM on my MacBook together with frontier models for different tasks. I am in the process of setting up a 3 Mac studio system to serve AI to my team.
    
    hedora - 5 days ago
    
    What's wrong with using a 3d accelerator and falling back to CPU graphics if needed? Pixels / joule is orders of magnitude better on an iGPU than on the CPU. (Which can matter over a 8-12 hour editing session, maybe.)
    
    zozbot234 - 5 days ago
    
    Modern IDEs don't use 3D at all, nor do they use the sprite-like 2D graphics that GPUs excel at and that can accelerate, e.g. mobile touch- and swipe-based UX. The main thing they do is font rendering, and accelerating that on GPU while keeping visual quality unchanged is quite complicated. The graphics pipeline doesn't really help all that much.
    
    manmal - 5 days ago
    
    Agents are read-only per default in Zed. You should really get off your high horse.
- DrewADesign - 6 days ago
  
  Multiple gazillion dollar companies each seem to be spending to ensure that they alone pretty much dominate all knowledge work, with customers eating up their tokens like Cookie Monster. I wonder if the any of them could survive as LLM providers if they not only failed to do that, but the entire industry ended up selling what the current Cookie Monster would call a “sometimes snack,” for very special occasions?
- datadrivenangel - 6 days ago
  
  In my experience once you get to ~30 gigs of ram for a model like Gemma4, the rest of the 128g of memory is simply nice to have. The speed and costs are what make it tough though, because its slower and more expensive than the same model served on a big accelerator card, and is going to be worse than a frontier model.
  - digitaltrees - 6 days ago
    
    I wonder if it really needs to be worse. I am playing with the idea of fine tuning a model on my exact stack and coding patterns. I suspect I could get better performance by training “taste” into a model rather than breadth.
    
    epicureanideal - 6 days ago
    
    I also wonder about JS only, Python only, etc models.
    Maybe the future is a selection of local, specific stack trained models?
    
    robrenaud - 5 days ago
    
    There is some recent work on modularizing knowledge in LLMs.
    https://arxiv.org/html/2605.06663v1
    It might be possible to train a big generalist that is a composition of modules, some of which can be dropped dynamically at inference time, depending on the prompt.
    
    andy_ppp - 6 days ago
    
    These models being able to generalise at coding will likely get worse if you remove high quality training data like all of python.
    
    jimbokun - 5 days ago
    
    That approach has its advantages, but sometimes I want to generate code for a language or kind of project I’m not experienced with using the accepted best practices.
    
    andy_ppp - 6 days ago
    
    Fine tuning these models (at least with PPO or equivalent) requires even more VRAM than inference does, potentially 2-3 times more.
    
    rusk - 5 days ago
    
    You could use PEFT? Operating on only a subset of weights is fairly standard practice nowadays …
    
    andy_ppp - 5 days ago
    
    Yes I used LoRA and it’s fine but I’m not convinced the model doesn’t end up more stupid and less general
  - ElectricalUnion - 4 days ago
    
    You need the rest of the ram for the context. If you don't want to end up with a toy context or quantized lossy context, is pretty easy to end up having to spend up 50+GB just for the KV cache, per simutaneous inference slot.
    
    zozbot234 - 4 days ago
    
    [dead]
- sanderjd - 5 days ago
  
  Are there any harnesses that are attempting to optimize for using local models like this? Unsurprisingly, my naive attempts to integrate with harnesses designed for frontier models have gone poorly. But it seems like a harness that understands the capabilities and limitations better could perform significantly better.
- fennecfoxy - 6 days ago
  
  >It's here, right now.
  I mean I've been forcing my good old 1080ti to run local models since a short while after llama was first leaked.
  But I wouldn't say "local models are here" in the same way as "year of the Linux desktop!111"
  Until someone can just go out and buy some sort of "AI pod" that they can take home, plug in and hit one button on a mobile app to select a model (or even just hide models behind various personas) then I wouldn't say it's quite there yet.
  It's important that the average consumer can do it, I think the limitations for that are: things are changing too quickly, ram+compute components are exceedingly expensive now, we're still waiting on better controls/harnesses for this stuff to stop consumers not just from shooting themselves in the foot, but blowing their foot clean off.
  Would be interesting to see a Taalas-like chip in a product, albeit there's so many changes going on atm with diffusion based models, Google's Turboquant (which as someone who has had to almost always run quantized models, makes a lot of sense to me).
  - skillina - 5 days ago
    
    What is the use case you see for non-technical users self-hosting? I think it’s important that tools remain available but I don’t expect it to be adopted by “average consumers.”
    I’m interested in self-hosting for privacy and control. I already owned the hardware I’m testing with, so my spend is limited to time and electricity.
    The “LLM pods” you describe will be loaded with spyware and adware (see: Smart TVs), and average consumers won’t max their compute around the clock so naturally data centers are able to make more efficient use of hardware by maximizing utilization.
    
    fennecfoxy - 5 days ago
    
    Agree with your point on them being loaded up with spyware etc because that's just how it is now I suppose.
    In terms of maximising compute I kind of agree but also kinda not - people's laptops and phones aren't burning at 100% 24/7 either. Sure AI requires so much more compute...but not _that_ much more, especially as technology marches on.
    For the general use case; I could be wrong but I'd see it sort of like a GPU/NAS/etc. "Pay once" rather than a subscription (to a service offered by a datacenter).
    But tbf, the way things are now _is_ all subscription models and consumers just kinda let it happen. I would love to be able to pay a one-off fee for lightroom...but I can't because they want a subscription to "pay for all the updating we're doing". They barely update shit.
    
    kelnos - 5 days ago
    
    And on top of that, I'm sure the "LLM pod" will still be sold on a subscription model so you get model updates etc.
    But I wish we could actually have nice things. I imagine there's a niche for a middle ground: a privacy-preserving device that uses local-only models and doesn't spy on the user, and sells for a one-time payment with no subscription. It'll be expensive, though, likely more expensive than using a cloud-hosted model.
  - cl0ckt0wer - 5 days ago
    
    There are local ai pods. They're like 2k for a low end.
- yieldcrv - 6 days ago
  
  I need to see these proper harnesses
  I tried oMLX and OpenCode a few weeks ago and the 65k context window was useless, it tried to analyze a very small codebase before going full on agentic and ran out of context window immediately
  I don't have time to tweak 1,000 permutations of settings just re-prove that its not as smart as Opus 4.6
  I need out the box multimodal behavior as similar as typing claude in the command line and its so not there yet
  but I'm open to seeing what people's workflows are
  - cyberax - 6 days ago
    
    I'm playing with a tape drive for backups, so I asked a local model to rewrite LTFS ( https://github.com/LinearTapeFileSystem/ltfs ) in Go.
    I gave it the reference C implementation, the LTFS spec from SNIA, and asked it to use the C implementation to verify the correctness of the Go code.
    LTFS is a pretty straightforward spec, so it made a very reasonable port within about 2 days. It's now working on implementing the iSCSI initiator (client) to speak with my tape drive directly, without involving the kernel.
    Edit: the model is Qwen3.6-35B
  - phamilton - 6 days ago
    
    I'm running opencode with qwen3.6-35b-a3b at a 3-bit quant. I also have qwen3.5-0.8b used for context compaction. I run with 128k context.
    It's usable. I set it loose on the postgres codebase, told it to find or build a performance benchmark for the bloom filter index and then identify a performance improvement. It took a long time (overnight), but eventually presented an alternate hashing algorithm with experimental data on false positive rate, insertion speed and lookup speed. There wasn't a clear winner, but it was a reasonable find with rigorous data.
    
    Balinares - 6 days ago
    
    Do you encounter looping issues at such low quants? How do you deal with those?
  - nullsanity - 6 days ago
    
    Hey man, you can just say "I'm lazy, so I'm staying with the cloud. if I wanted to use my brain, I wouldn't be using AI, gosh" - it's much shorter.
    
    fennecfoxy - 6 days ago
    
    Personal attacks are against the rules, by the way.
    
    yieldcrv - 5 days ago
    
    all the money and clout is in considering people’s reported problems as valid and solving them
    so when I encounter a common but invalidated friction, I explain it like I’m 5, understanding that many of the engineering and entrepreneurial problem solvers have the emotional intelligence of a 5 year old
- jimbokun - 5 days ago
  
  Has anyone tried to calculate the break even cost of buying a PC to run an LLM locally, vs the amount of tokens you could get from an AI provider?
  - zozbot234 - 5 days ago
    
    The basic answer: very much not worth it at face value, becomes arguably worth it once you start worrying about future rug pulls from the big AI providers. (And that does include the market for third-party inference, at least at present.) It's also worth it if you have existing hardware to repurpose, but that's obvious and not what you were asking about.
    
    thot_experiment - 5 days ago
    
    Also you can feed it ALL of your data willy nilly without ever worrying about safety because you can just do it with the LAN cable unplugged, for applications that demand data hygiene it's a cheat code that guarantees safety without any sort of data sanitization.
- nsvd2 - 5 days ago
  
  I run Gemma locally on a 3090, it's amazing how useful it is to be able to call out to ollama in a bash script or cron job.
- winocm - 6 days ago
  
  Perhaps I am the odd one out here, but a small part of me wants to see what happens when you run a proprietary SOTA model on a laptop.
  - pianopatrick - 6 days ago
    
    Currently I'm testing something like this just to see what happens. I have an old laptop with 4GB of RAM. I attached a USB drive with Gemma 4 31B model (which is 32.6 GB). Currently the laptop is running llama.cpp and trying to respond to a prompt by streaming the model from disk.
    The USB drive light is flickering, showing something is happening. It's been about 8 hours since I entered the prompt and I've gotten about 10 tokens back so far. I'm going to leave it running overnight and see what happens.
    
    zozbot234 - 5 days ago
    
    Wow, that's a true worst case scenario especially if the USB is just plain old USB 2.0 (max 480 Mbps) and/or if the drive is a spinning disk. How's the CPU doing, though? Is there any headroom given the USB bottleneck?
    
    pianopatrick - 5 days ago
    
    running top shows the process llama-cli taking 29% of CPU and 88% of memory, while process usb-storage is taking 9% of cpu and 0% of memory
    
    stuaxo - 6 days ago
    
    Nice.
    What did you use to do this, something standard like llamacpp or something else like vllm or your own contraption ?
    
    pianopatrick - 5 days ago
    
    llama.cpp
    It's now spit out about 40 tokens after maybe 18 hours and has not finished the "thinking" stage of responding to the prompt. I'll let it keep running to see what happens
    
    - 5 days ago
    
    [deleted]
  - SilentM68 - 6 days ago
    
    Not sure if this is exactly the scenario you envision but I run ComfyUI on an Acer Helio 300 laptop, from four years ago. Has 16GB RAM, NVIDIA GeForce RTX 2060 w/6144MiB of VRAM and have generated a few images using "NetaYumev35_pretrained_all_in_one.safetensors" @ 10.6GB checkpoint, (well beyond the 6GB capacity of the RTX 2060 card). That being said, it takes more than 10 minutes to complete the task. Of course, I have to turn off all other apps, and browser tabs or hibernate them. If I don't, the laptop's fans begin to spin up like an airplane propeller. It's worth mentioning that I've tried to do this with other IDEs and all seem to fail with some error or another, usually out of VRAM issue. I've only gotten it to work with ComfyUI.
    I use an anaconda environment, though would have preferred an "uv" environment, on Linux and automate the startup sequence using the following script (start_comfy.sh) from the term rather than manually starting the environment from same said term:
    #!/bin/bash
    #
    # temporary shell version
    eval "$(conda shell.bash hook)"
    conda activate comfy-env
    comfy launch -- --lowvram --cpu-vae
    Here are some of the images: https://imgbox.com/nqjYhdx3 https://imgbox.com/93vSWFic https://imgbox.com/qs1898dz
    I'm hesitant to increase the sizes of the renders as that will surely stress my laptop's components.
    
    t_mahmood - 6 days ago
    
    I'm not running local for exactly the same reason, to not stress my components. As it seems we are in for a long haul due to this AI bubble (can't wait for it to pop) so need to make sure I survive this madness, as for sure I can't afford to replace anything right now.
    
    SilentM68 - 5 days ago
    
    I don't know that any AI bubble will pop. AI can be used to accelerate therapies, cures, make scientific advancements. Add to that, quantum science technology which if successful, should accelerate things, depending on who's the one at the wheel. Problem is the gap between now and then (e.g. age abundance). It's going to be a difficult road for good number of the population until that day comes. I'm scouting potential locations of bridges, to live under, so that I can find and claim one when homeless day arrives.
    I can't help but feel that companies using AI, engaging in employee layoffs, are shooting themselves in the foot. The endgame for them will be zero profits, since displaced workers translates to no money to pay for goods and services :|
    
    306bobby - 4 days ago
    
    Both the bubble popping and it's legitimate use cases can exist at the same time.
    For example, the www bubble popped, but the Internet didn't go away
    
    SilentM68 - 4 days ago
    
    True
  - woctordho - 5 days ago
    
    I'm using ROG Phantom laptop with Strix Halo iGPU that has a whopper of 128 GB VRAM. Next year there will be the rumored Medusa Halo with 256 GB VRAM, which is more than enough to run DeepSeek V4 Flash.
  - kelnos - 5 days ago
    
    I don't think you're the odd one out. I would be very curious to try to run Opus 4.7 on a (high end) laptop. I'd also like to see how it runs on a high-end workstation rig built for it.
  - amelius - 6 days ago
    
    You burn your lap?
  - reisse - 6 days ago
    
    Nothing special?
    I mean, inference engine might need to get some tweaks, to support whatever compute is available. But then, if you put a few terabytes of disk for swap, and replace RAM to bigger sticks if possible, it should work? Slowly, of course, but there is no reason it should not to.
    
    reverius42 - 6 days ago
    
    The big difference will be measuring seconds per token instead of tokens per second.
    
    martijnvds - 6 days ago
    
    Seconds per token is just fractional tokens per second ;)
    
    degamad - 6 days ago
    
    > fractional
    Reciprocal?
  - yfw - 6 days ago
    
    You can if you have enough ram slots?
- dust1n - 6 days ago
  
  Can you share how you use it to categorize trip photos!
  - Farmadupe - 5 days ago
    
    I'm not sure there's a one-stop shop for this at the moment. I think the process is:
    * Have a box with sufficient spare (V)RAM -- probably 8G for simple categorization with qwen3.5-4b, and 24G or more for more intelligent categorization with qwen3.6-27b or gemma4-31b.
    * Download or compile llama.cpp. Choose a model, then choose one of the "quantized" builds that will actually fit on your hardware. There are literally hundreds to thousands of these per model on Hugging Face.
    * Spend half a day tuning command-line parameters until llama.cpp doesn't crash.
    * Watch llama.cpp regularly OOM itself, then put it in a systemd service with a memory limit so it doesn't take the entire machine down when it dies.
    * Download all your photos to a folder.
    * Start vibing a Python script to categorize your images by repeatedly prompting the LLM with each image in turn.
    * Spend days tweaking/refining the prompt to try to get the LLM to actually do what you want.
    The endgame is one of:
    * The local model categorizes your images. Yay.
    * The local model is too slow and you give up. Boo.
    * The local model is too slow, so you spend $1k-$10k on hardware. Your image categorization task becomes a cover story for buying new gear. Yay.
    * The local model can't understand your categorization metric, so you give up. Boo.
    * You eagerly await news of the next open model being released. Yay?
    * You consider replacing your local model with a frontier model, but then you realize you'd be spending $500 to categorize your photos. Boo.
    * You refuse to allow Google/Gemini/Anthropic to train on your nudes. Boo.
  - creativeSlumber - 5 days ago
    
    this is one of the most popular options. Self hosted. https://immich.app/
  - Mario9382 - 5 days ago
    
    I'm also interested on how to do this
- antidamage - 6 days ago
  
  This is my exact setup as well and dear lord gemma is absolutely batshit insane. I'm trying to get a self-reflection and confidence loop going now, but it does feel like it's not the local resources, it's the limits of the training. Dedicated coding or dedicated real-world task models would be a good optimisation.
root_axis - 6 days ago

You are greatly underestimating the hardware requirements for productive local LLMs. Research consistently shows that parameter count sets the practical ceiling for a model's reliability. Quantized models with double digit param counts will never be reliable enough to achieve results in the realm of something like Opus 4.6.
- thot_experiment - 6 days ago
  
  Flat wrong. Q6 Gemma 31b feels a lot like opus 4.5 to me when run in a harness so it can retrieve information and ground itself. The gap is not that big for a lot of usecases. Qwen MoE is fast as fuck locally for things that are oneshottable. I have subscriptions to all the major providers right now and since Gemma 4 and Qwen 3.6 came out I haven't hit limits a single time. I'm actually super surprised by the number of things I try with Gemma 4 with the intent of seeing how it fails and then having Claude do it only to come away with something perfectly usable from the local model.
  - cbg0 - 6 days ago
    
    Your n=1 might not be very relevant outside your personal use. In less contaminated benchmarks Gemma 4 is way below Sonnet 4.5, let alone Opus models: https://swe-rebench.com/
    
    thot_experiment - 6 days ago
    
    Benchmarks only give you the roughest idea of how models compare in real world use. They're essentially useless beyond maybe classifying models into a few buckets. The only way you gain an understanding of something as complex as how an LLM integrates with your workflow is by doing it and measuring across many trials. I've been running Opus 4.7 in Claude Code and Gemma 4 31b in parallel on projects for hours a day this past week, Opus 4.7 is definitely better, but for many things they are roughly equivalent, there are some things on the edge that are just up to chance, and either model may stumble across the solution, and there are some areas of my work that reliably trip up both models and I get better mileage out of writing code the old fashioned way. I understand that I'm just one data point, but I'm not writing CRUD apps here, I'm doing DSPs and weird color math in shaders, I don't think any of it is hard, and the stuff that I think is hard none of the models are good at yet, but idk, they just don't seem that extremely disparate from one another.
    FWIW I think Gemma 4 31b is more likely to be of use to me than Sonnet, idfk, maybe it's a skill issue but I love Opus 4.7, undisputed king, but Sonnet seems borderline useless and I basically think of it as on the same level as Qwen 35b MoE.
    
    cbg0 - 6 days ago
    
    "essentially useless" is a gross overstatement. Your personal benchmarks will always provide you with the most value, but disregarding standardized benchmarks because you care more about vibes is not exactly scientific.
    
    thot_experiment - 6 days ago
    
    Sorry, "essentially useless in the context of local model availability". It's a fine model but it's tier of inference is fully fungible.
    
    larodi - 6 days ago
    
    I’m building a pipeline and testing against gemma4 and Gemini’s 3-1 flash. Both are very good on certain tasks and even n-way clustering works almost perfect almost always.
    But they diverge greatly on other particular ones whenever the ViT tower and the apriori knowledge of the world is crucial. I wish Gemma was on par but both me and Google know they not.
    
    onion2k - 6 days ago
    
    You do need to ask whether or not Sonnet or Opus are overkill for a lot of work though. If Gemma4 with some human effort can achieve the same result as Sonnet then it's arguably a lot more cost effective as you're paying for the person to operate each one regardless.
    
    thot_experiment - 6 days ago
    
    I 100% agree with your philosophy but I wanna note that I genuinely find Gemma 4 31b to be better than Sonnet. To be clear, this makes NO sense to me, so I'm probably just high and making stuff up or just biased by a small sample size since I don't use Sonnet that often. I find that Gemma 4 makes the sort of "dumb AI" mistakes Sonnet makes less often, especially in agentic mode. I genuinely don't know how that can be true but Sonnet feels much more like "autocomplete" and Gemma 4 feels like "some facsimile of thought".
  - alfiedotwtf - 6 days ago
    
    I’m guessing Qwen3.6 for agentic coding and Gemma4 for non-coding stuff?
    
    thot_experiment - 6 days ago
    
    No, exactly the opposite actually. Qwen3.6 is too imprecise for long running agentic tasks. It doesn't have the same ability to check itself as Gemma does in my testing. I keep Qwen MoE in vram by default because there are tons of tasks i trust it to oneshot and it's 90tok/sec is unparalleled, anything where I don't want to have to intervene too much it can't be trusted.
    
    alfiedotwtf - 5 days ago
    
    Oh interesting. I've read that Gemma 4 is really good for creative stuff, but I'm mostly interested in agentic coding. Unfortunately, each time I use Gemma 4, I just get it stuck in loops.
    
    thot_experiment - 5 days ago
    
    This is probably a precision thing, I think there's a really big difference in long running tasks between q4 and q6.
    
    alfiedotwtf - 5 days ago
    
    Ok, you’ve given me the umph to try again. Thanks!
  - stuaxo - 6 days ago
    
    What harness are you using ?
    I'm going to switch to local LLMs for most stuff soon.
    
    thot_experiment - 6 days ago
    
    Overall using screentime as the metric, derived from some imperfect logging and vibes it's about 50% OpenCode 15% Continue 15% my homebrew bullshit 13% Claude Code and 7% Cline. I've been deep on agentic stuff lately (1.3wks aka 3 months of AI time), there are only so many hours in the day to duplicate work and AB test, but in the past I've sworn by Qwen Coder + llama.vim and I still enjoy that workflow for deep work far more than I like prompting agents, but there's a lot of dross I'm learning to delegate.
    
    stuaxo - 5 days ago
    
    Interesting.
    I stopped doing local stuff for a bit when I realised I didn't know how well it is supposed to work so have been on Claude for a few months now.
    I think I'll try OpenCode this time.
    Usually I do stuff in devcontainers, qwen code (non local) was the only time I managed to lose some work as it got confused when I ran out of tokens.
    There's still quite a way to go - it does seem like Claude code itself is pretty badly coded, so I think there is a space for open source to come in with a high quality harness at some point.
  - root_axis - 6 days ago
    
    Sorry but you're just seeing what you want to see. The idea that a 31b model is anywhere even in the ballpark of something like Opus 4.5 is just absurd on its face.
    
    thot_experiment - 6 days ago
    
    False. The absolute capability is irrelevant, with the proper harness 31b is more than adequate for a very large portion of the tasks I ask AI to do. The metric isn't how good the model is at Erdos Problems, it's how reliably it can remove drudgery in my life. It just autonomously reverse engineered a bluetooth protocol with minimal intervention, it's ability to react to data and ground itself is constantly impressive to me. I do a ton of testing with these models, today I had Gemma answer a physics problem that Opus 4.7 gave up on. With a decent harness and context the set of tasks where their capabilities are both good enough is very surprising. The tasks I have that stump Gemma often also stump Opus 4.7.
    
    diordiderot - 6 days ago
    
    Maybe reaching for an analogy would be helpful here.
    Thot_experiment is saying that his 2016 Toyota Prius is a great and reliable car for his daily commute and running errands.
    Whereas everyone is screeching about its capability gap with a Lockheed Martin F35 lightning.
    
    thot_experiment - 5 days ago
    
    Yeah, thanks, though I think local models are at least a Cessna, which while being nothing like an F-35 can fly.
    
    aceazzameen - 4 days ago
    
    Flying is fun. But shooting Cessnas out of the sky is more fun!
    I'm kidding around. I run 31b models myself too and am perfectly happy with them.
    
    amelius - 6 days ago
    
    This is like saying that 640kB is enough for anybody.
    
    thot_experiment - 6 days ago
    
    No, it isn't. I am saying that the set of tasks that can be completed by Opus 4.7 has a surprisingly large overlap with the set of tasks that can be completed by Gemma 31B. It is meaningfully equivalent in many cases.
    (of course if i'm being honest 640kB is fine, i'm sure tons of the world's commerce is handled by less for example, the delta between a system with 640kb of ram and a modern one is near nil for many people, the UX on a PoS terminal does not require more than that for example, the hacker news UX could also be roughly the same)
    
    lioeters - 5 days ago
    
    > 640kB is fine
    How refreshing to hear this kind of old-school hacker thinking, in a thread where most people have given up on local computing in exchange for convenience and permanent third-party dependency.
    With embedded systems affordable and ubiquitous, hopefully a growing segment of the new generation will also learn to push the limit of available hardware and see how far we can take it. As an engineer there's a satisfaction in solving things with what you got.
    There's a new technique, 1-bit family of language models that can achieve up to 9x memory efficiency compared to existing models. Still multiple gigabytes for practical use I imagine, but it's great progress toward local AI, which I believe will be common in the near future. https://prismml.com/news/ternary-bonsai
    
    degamad - 5 days ago
    
    It's more like saying "HIMEM.SYS is not much better than 640kB".
    
    BoredomIsFun - 6 days ago
    
    It would be true, if model providers did not throttle their models. I do not have definitive proof they do but the rumors are abundant.
    
    creativeSlumber - 5 days ago
    
    I think you are missing the point here. what matters is for that user the local models are good enough for their use case.
  - KurSix - 6 days ago
    
    [flagged]
- segmondy - 6 days ago
  
  Jokes on you. We are already running Deepseekv4Flash, Mimo2.5, MiniMax2.7, Qwen3-397B locally in very affordable hardware. These models are in the real of Opus4.6. For those of us a bit crazy, we are running KimiK2.6, GLM5.1 and more ...
  - root_axis - 6 days ago
    
    I have two A100s and have been playing with local models for years. There's definitely moments where they are quite impressive, but small context sizes and unreliability become immediately obvious.
    > For those of us a bit crazy, we are running KimiK2.6, GLM5.1
    Yes, those can compare to Opus, but you can't run those unquantized for less than $400k in hardware.
    
    doctorpangloss - 6 days ago
    
    Two Mac Studio M3 Ultra 512GB and 1 USB cable can run all those models - maybe about $30,000 in hardware - and based on my benchmarks, those Mac Studios were twice as fast as the A100s on Deepseek v4 Flash, which has a quantization but not really a lossy one.
    
    root_axis - 6 days ago
    
    That cannot run KimiK2.6 or GLM5.1 i.e models within the ballpark of anything offered by frontier companies.
    
    segmondy - 5 days ago
    
    I run kimik2.6 and GLM5.1 on less than $10,000 system. Granted I started putting my system together 2 years ago when things were much cheaper. I run DeepseekV4Flash with 1 million context locally.
    
    Galanwe - 6 days ago
    
    Yes it can, but the experience is not great.
    A single M3 maxed can run a Q2 Kimi 2.6, though thats with a hardly degraded perplexity.
    2x M3s with RDMA can run a lossless Kimi2.6 at Q4, but with CPU only you would get okayish decode but horrible (+1m) TTFT, that wouldnt be a great _interactive_ experience.
  - binyu - 6 days ago
    
    They all still fall short of Opus 4.6, definitely though. They are good but fail on extremely complex tasks, in contrast with a frontier model that will keep on trying until it succeeds or exhausts the solutions space.
    
    julianlam - 6 days ago
    
    Not by much, and moving goalposts makes for a bad comparison. Local open weight models are already more powerful than frontier models from only a year back.
    If you believe what you read here, the gap is closing fast.
    
    segmondy - 5 days ago
    
    frontier models don't keep trying until they succeed. that's a harness problem and best believe it, the best harness are private and not public.
    
    binyu - 5 days ago
    
    It is much more of a context window size and model capabilities problem. Local models are not even remotely close in solving complex problems, even when used with the same harness.
- wincy - 6 days ago
  
  Won’t these H100s drop in price in a few years? With the data center build out surely these will become 1/10th the price and you’ll be able to set up a local LLM as good as opus 4.7. Even if the frontier model become more advanced, and memory hungry, you could use the same power usage as your oven to run a current day frontier model as needed? If I could drop $10,000 to have an effectively permanent opus 4.7 subscription today, I would.
  - root_axis - 6 days ago
    
    > Won’t these H100s drop in price in a few years
    Doubtful. The increase in demand is greatly outpacing supply, and all signs point to a continued acceleration in demand
    > If I could drop $10,000 to have an effectively permanent opus 4.7 subscription today, I would.
    lol well obviously, but realistically that price point is going to be closer to $100k, with a perpetual $1k a month in power costs.
    
    repelsteeltje - 5 days ago
    
    We tend to overestimate the short-term change, while underestimating the long term impact. A lot of hot air will likely vent when businesses realize LLMs didn't magically replace their workforce. Also, prices will go through the roof when energy production inevitably fails to keep up with demand for compute. Also, Moore's law more or less predicts we'll have today's technology in our phones in less than a decade.
    I predict the B200 data centers we're build today will be obsolete in 3 years and we'll be using whatever models and hardware that isn't even on a road map today. Likely not NVIDIA, likely not OpenAI or Anthropic. Maybe Chinese?
    In the mean time, we must continue building software with the clumsy coding agents tied to cloud services as this (for now) seems to be about the only area where AI economically makes sense.
    
    dyauspitr - 6 days ago
    
    Why? These models are going to keep drastically improving and given all the new data centers token prices will probably drop a lot in the future. Seems shortsighted given the absurd timelines these things have been improving on.
    
    wincy - 6 days ago
    
    Cool, thanks for the information. I guess they drive prices down by massively parallelizing requests on say an H100 X8 array? So this is spread across. So if I say, wanted to use it for 8 hours a day in my theoretical world it’d be too expensive. My work definitely wouldn’t pay $100,000 for a server farm even if it’d give an AI to all our employees, you’d have to have engineers, a colocation space, basically all the problems that companies didn’t like and went to AWS for.
    
    root_axis - 6 days ago
    
    Well $100k was a generous guesstimate for some time in the future where something like an Opus 4.7 is old news.
    If we think about the near future, something like Kimi2.6 is within the realm of Opus 4.6 today, but requires closer to $700k in hardware to run.
    
    Galanwe - 6 days ago
    
    Kimi 2.6 is very close to the Opus family from my experience. Also it does absolutely not require $700k to be able to run locally in an interactive fashion. We are talking more in the range of $10k for a slow Q2 with degraded perplexity, to ~$35k for an acceptably fast 200k context Q4 (quasi lossless perplexity).
    
    aaronblohowiak - 6 days ago
    
    taalas!!!
  - 33MHz-i486 - 6 days ago
    
    opus 4.7 caliber models are trillions of params, and a single instance would likely run on multiple h200s. $100k of hardware. not coming to your laptop anytime soon.
- ActorNightly - 5 days ago
  
  Yes and no.
  The best analogy is the difference between having N senior level engineers working for you, versus having N entry level engineers.
  With frontier cloud models, you can give a single invocation one task, and it can figure everything out.
  With local models, you have to manage the inputs and outputs quite a bit more, but you can achieve similar results for tasks you set up harnesses for. They are not as a good at finding the right answer internally from their own weights, but they are very capable of ingesting context and reformatting text - for example, for debugging, local models can debug issues quite well if you give them the error and documentation for a particular feature you are trying to implement.
- CuriouslyC - 6 days ago
  
  Parameter size gets you world knowledge and better persistence of behavior as context grows. Both of those things can be engineered around to a large degree, and the latest Qwen models show that small models can be quite smart in narrow domains and short time windows.
  - alfiedotwtf - 6 days ago
    
    … maybe we should just teach models how to get their world knowledge from a local Postgres connection! Then the model can be tiny, and it can query to its little heart desires AND run on commodity hardware TODAY!
- stubish - 6 days ago
  
  It depends on what you mean for 'productive'. Article mainly seems to be about targeting consumer level hardware, such as the Neural Processing Unit you need for a 'Copilot PC'. Windows Recall is (was?) one such local AI application. If Microsoft get their way and my next PC has one, I look forward to using it for 'productive' purposes such as playing games, handling natural language stuff and leaving my GPU free for GPUing.
- josteink - 6 days ago
  
  > You are greatly underestimating the current hardware requirements for productive local LLMs.
  Fixed that for you. Right now most models produced are based on floating point maths and probabilities, which is "expensive" to do math on.
  Microsoft has researched 1-bit LLMs which can run much more efficiently, and on much cheaper hardware[1].
  If this research is reproducable and reusable outside their research models, this means the cost of running self-hosted LLMs will be reduced by an order of magnitude once this hits mainstream.
  [1] https://github.com/microsoft/BitNet
- byzantinegene - 6 days ago
  
  i would argue we don't need anything near Opus to be productive. Sonnet is plenty productive enough
  - root_axis - 6 days ago
    
    I use Opus 4.6 as an example because it's the LLM that has been widely recognized by the public as being reliably capable of doing real work across many domains. However, the same logic applies to Opus 4.5 and even previous generations. These models have huge parameter counts and large context sizes, there's no training technique that can compensate for those qualities in small and quantized models.
  - JumpCrisscross - 6 days ago
    
    > we don't need anything near Opus to be productive. Sonnet is plenty productive enough
    For niche applications, sure. For general use, I think the tendency towards the best model being used for everything will–to the model publishers' delight–continue. It's just much easier to get a feel for Opus and then do everything with it, versus switch back and forth and keep track of how Haiku came up with novel ways to dumbfuck this Sunday evening.
DrScientist - 6 days ago

I think it's inevitable that access to good enough LLM models will be democratised.
However that's not the real battle here. The real battle is control of information to operate over.
While I might have access to a decent model - I don't have the huge integrated databases of everything that companies like Google have, and increasingly governments will accumulate.
As a citizen AI operating of these large datasets is where the concern should be.
pier25 - 6 days ago

How fast do you reckon most people will be able to afford 128-256GB of RAM?
- Schiendelman - 6 days ago
  
  Other than this recent spike, it's been trending cheaper continuously for decades. In a few years 128GB will be as affordable as 12GB (what flagship phones have now) is today.
  - pier25 - 6 days ago
    
    I'm sure it will happen but I don't think it will be soon.
    10 years ago I was using 16GB in my MBP and today it's 48GB. It's just a 3x increase during mostly a bonanza period.
    
    DennisP - 6 days ago
    
    For most of that time, I don't think many people had much use for more ram than that. If demand picks up, companies will provide it.
    And the Mac Studio was available with 512GB until ram got scarce and they cut the max in half recently.
    
    pier25 - 6 days ago
    
    The Mac Studio is a high end computer that the majority can't afford or justify its expense.
    There's plenty of demand for RAM right now. We'll see how this turns out.
    
    numpad0 - 5 days ago
    
    IMO that was a really weird choice that everyone seemed to make. DDR5 2x64GB before the spike was like $250. I had not much justification to NOT go with 64GB for my pre-COVID build.
    It seems that a lot of PC building people are confused too deeply by Intel marketing and fixated on getting the flashiest CPU attainable within budget. Similar things happened with previous AI hype, and some people were using HDD boot drives on GPU rigs and asking others whether low end i7 would cut it. They acted very confused when told that they need SSD and Pentium is plentium.
    I mean, there is a shortage going on, but when it'll be over anyhow - whether due to all the last three standing filing bankruptcy or CXMT-Huawei starts delivering in shiploads or Kioxia enters the market - and it comes back down to $2/GB, or even $5/GB, just max it out and forget about it for 10 years. Why not.
    
    - 6 days ago
    
    [deleted]
  - amelius - 6 days ago
    
    That "spike" could be a wall ...
  - fennecfoxy - 6 days ago
    
    Nope.
    Because late stage capitalism demands endless growth in order to pay executives and shareholders (especially those late to the train) more and more YoY.
    And those requirements for growth mean that cost cutting is needed. Over the past few decades cost _have_ been cut, building things more efficiently, components becoming cheaper, larger volumes in mass manufacturing.
    But we have already reached a point where there are no other places to cut than the quality of the product itself. Look to shrinkflation in food and other places - look at how "live action" versions are being made of previously animated movies, how game franchises from 2 decades ago are being brought back from the dead, the huge influx of remasters etc.
    Why? Because it's cheaper to revive/reuse an existing IP than it is to create a new one + it guarantees success with the drooling consumer masses. And cheaper = more Ferraris for the multi millionaire/billionaire execs.
    See how much Mario movie made? Just wait...bet you there'll be a live action version. ;)
- cpt_sobel - 6 days ago
  
  Their prices are currently so unreachable because of the big players hoarding every chip they can get their hands on, but if/when the market realizes that locally deployed LLMs are the way to go, maybe (hopefully?) then more chips will be available to the consumers for lower prices.
  - Arn_Thor - 6 days ago
    
    The only way that'll happen is if deep-pocketed corporate buyers exit the market almost entirely, and therefore stop being the highest-available bidder. Even in a scenario where it's obvious to everyone that consumer-side hardware is a viable option, it's still not in the big AI providers' interest to abandon the effort to push/pull everyone to their cloud. They'll keep buying as long as there's liquidity to fund them and the will to do so, and we're a ways off that collapsing. I'm quite pessimistic. Prices will probably come down in the next 12-18 months, but not to where they were before this
- discordance - 6 days ago
  
  “Gradually, then suddenly”
emadb - 6 days ago

Do you think small models will arrive? I mean if I need to write a web application in typescript why should I use a model that knows all the programming languages and it is able to reply to any questions about almost everything? I just a need a small performant model that knows how to write web applications in typescript. That could be very helpful and easy to run on my laptop.
- driese - 6 days ago
  
  For the same reason that a human who is fluent in five languages can probably express themselves better in either one compared to human that only speaks one, while also having a more nuanced understanding of general grammar. From what I know, learning on a more diverse set makes a model better overall.
  - amelius - 6 days ago
    
    This might be an interesting research question: can you train a model on many languages, and then extract a much smaller model that knows only one language without much loss of quality?
  - kelnos - 5 days ago
    
    Humans brains and LLMs are not the same, though. I don't think your analogy is remotely applicable, even if your conclusion may be correct.
- thot_experiment - 6 days ago
  
  Depending on your laptop, if your laptop is a Strix Halo or a Macbook with a decent amount of ram, that day they arrived is about 6 months ago, and today if you can run Gemma 31b, you're golden for your basic workslop code. You can do most of it with local models. Heck, for a lot of the tier of programming you might encounter in the average job Qwen 35b MoE is good enough and it can hit 100tok/s on decent hardware.
simooooo - a day ago

Even on a 5090 qwen is really impressive. Felt as good as Claude for little projects.
elbasti - 6 days ago

> The question will be: how much of the current compute capacity craze will local hosting give the kiss of death to and what that means for the market.
This will depend on how much inference happens for consumer (desktop, local) vs enterprise ("cloud"), vs consumer mobile (probably also cloud).
I would assume that the proportion of "consumer, local" is small relative to enterprise and mobile.
- stubish - 6 days ago
  
  I think the proportion is small because someone has to pay for the cloud services. When phones, PCs and Desktops ship with NPUs whole new markets open up for all that stuff people want but not enough to pay for.
RataNova - 6 days ago

The biggest impact of local models may simply be that they prevent remote inference from becoming the only game in town
xnx - 5 days ago

> how much of the current compute capacity craze will local hosting give the kiss of death to and what that means for the market.
Nvidia and other hardware sellers would love if they could sell a bunch of chips to individual consumers that would sit idle for 95% of its life.
inf3cti0n95 - 6 days ago

Certainly, I don't think Data centers are the way here.
I guess, it'll most likely be an AI processing and everything else becoming API.
In case of GPTs and Claudes of the world. They'll be just using an Indexing APIs and KB on top of their LLMs.
dakolli - 6 days ago

This is simply delusional, It cost 20-30k a month to run Kimi 2.6. The tokens are sold for $3 per mm.
To sell tokens profitably you'd need to be able to run inference at 150 tokens per second for less than $1,000 USD a month.
I don't think people realize how expensive it is to host decently capable models and how much their use of capable models is subsidized.
You can only squeeze so many parameters on consumer grade hardware(that's actually affordable, two 4090s is not consumer grade and neither is 128gb macbooks, this is incredibly expensive for the average person, and the models you can still run are not "good enough" they are still essentially useless).
People are betting their competency on a future where billionaires are forever generous, subsidizing inference at a 10-1 20-1 loss ratio. Guess what, that WILL end and probably soon. This idea that companies can afford to give you access to 2mm in GPUs for 5 hours a day at a rate of $200.00 a month is simply unsustainable.
Right now they are trying to get you hooked, DON'T FALL FOR IT. Study, work hard, sweat and you'll reap the benefits. The guy making handmade watches, one a month in Switzerland makes a whole lot more than the guy running a manufacturing line make 50k in China. Just write your own fkin code people.
Don't bet your future on having access to some billionaire's thinking machine. Intelligence, knowledge and competency isn't fungible, the llm hype is a lie to convince you that it is.
- zozbot234 - 6 days ago
  
  No one runs SOTA models 24/7 for individual use or even for a single household or small business, whereas you can run your own hardware basically 24/7 for AI inference.
  With the new DeepSeek V4 series and its uniquely memory-light KV cache you can even extend this to parallel inference in order to hide memory bandwidth bottlenecks and increase compute intensity.
  This is perhaps not so useful on a 128GB or 96GB RAM Apple Silicon device (I've seen recent reports of DS4 runs with even one agent flow hitting serious thermal and power limits on these devices, so increasing compute intensity will probably not be helpful there) but it will become useful with 64GB devices or lower that have to stream from a slow disk, or with things like the DGX Spark or to a lesser extent Strix Halo, that greatly overprovision compute while being bottlenecked on memory bandwidth.
  - doctorpangloss - 6 days ago
    
    deepseek v4 flash on mlx at 1m context runs at 20 t/s decode on a mac studio m3 ultra with 512gb of RAM
    
    alfiedotwtf - 6 days ago
    
    What is everyone running DeepSeek v4 Flash with?!
    It’s currently unsupported on Llama.cpp and vllm doesn’t support GPU+CPU MoE, so unless all of you have an array of DGX Sparks in your bedroom, what’s the secret sauce?!
    
    doctorpangloss - 5 days ago
    
    you can run it today with mlx if you have 256g or 512g mac studio. no "antirez" fork needed.
    it isn't that large of a model and the compressed kv implementation is not that complicated
    the problem is that they released the model in a quantized format that is more complex than it appears, and people make a lot of mistakes working with it. it is quantization-aware-trained, so you can't "just" upscale it and scale down.
    vllm runs dsv4 flash fine right right now
    dgx sparks cannot really run it correctly right now with released vllm but there are PRs, it's just a matter of time. you would need 3 of them. they will still be almost 1/2 as fast as the mac studio.
    so the punchline is, well, this is why the 512g mac studio is such a hot commodity right now.
    
    alfiedotwtf - 5 days ago
    
    Unfortunately I didn't get a Mac with big ram at the time it was cheap, and I'd personally focus on moving away from Apple and going Linux fulltime at work and home (currently Macbook for laptop connected to my big rig, well it's not that big compared to the AI people in here).
    
    zozbot234 - 5 days ago
    
    What kind of RAM does your MacBook have? It might still be worth experimenting w/ DS4 using disk offload, though it would be dog slow at best and the RAM would be much too limited for meaningful parallelism, especially for larger contexts.
    
    alfiedotwtf - 5 days ago
    
    This might be my only hope until RAM prices come down to human levels again
    
    zozbot234 - 5 days ago
    
    If you have a 256 GB or 512 GB Mac Studio, the real game is to run multiple sessions in parallel in order to make the best use of your limited memory bandwidth. You'd have plenty of excess RAM for that given how small the KV cache is even at max context.
    
    zozbot234 - 6 days ago
    
    https://www.github.com/antirez/ds4 (from Antirez of Redis fame) runs a 2-bit quant on Apple Silicon hardware and 96GB or 128GB RAM.
    
    alfiedotwtf - 5 days ago
    
    I've been keeping an eye on Antirez's Metal fork for llama.cpp, but I totally missed this. Whoa, nice. Giving it a go, thanks!!
    
    zozbot234 - 5 days ago
    
    What kind of hardware are you planning to run this on? As mentioned already, I've been trying to understand how gracefully it might degrade on 64GB RAM or perhaps lower (the total weights size is 80GB at the provided quant) using SSD offload for the weights, and then (assuming it works and doesn't just OOM) whether the tok/s figures might meaningfully improve in that scenario by running multiple sessions in parallel.
    
    alfiedotwtf - 5 days ago
    
    I've got a 4060 Ti 12Gb with 128Gb RAM. I was hoping once I could demonstrate to myself that I could run Deepseek v4 Flash locally (even at really slow speeds), then it would be worth my time and money to get something to run it > 20t/s.
    ... currently testing out Stepfun 3.5 Flash Q4_k_m as a stop gap (unless it blows my socks off first).