Ternary Bonsai: Top Intelligence at 1.58 Bits

129 points by nnx 3 days ago

Open access for next 5 hours (Ternary-Bonsai-8B-Q2_0.gguf, running on RTX 3090) or until server crashes or the this spot instance gets taken away :) =>

https://uklkyvetsjf7qt-80.proxy.runpod.net

    ./build/bin/llama-server \
     -m ../Ternary-Bonsai-8B-Q2_0.gguf \
     -ngl 999 \
     --flash-attn on \
     --host 0.0.0.0 \
     --port 80 \
     --ctx-size 65500 \
     --batch-size 512 \
     --ubatch-size 512 \
     --parallel 5 \
     --cont-batching \
     --threads 8 \
     --threads-batch 8 \
     --cache-type-k q8_0 \
     --cache-type-v q8_0 \
     --log-colors on

# llama.cpp is forked one: https://github.com/PrismML-Eng/llama.cpp.git

# The server can serve 5 parallel request, with each request capped at around `13K` tokens...

# A bit of of benchmarks I did:

# 1. Input: 1001 tokens, ttfs: 0.3 second, outputs: 1618 tokens ~140t/s

# 2. Input: 9708 tokens, ttfs: 2.4 second, outputs: 2562 tokens at ~106t/s

# Vram usage was consistently at ~7GiB.

> https://huggingface.co/prism-ml/Ternary-Bonsai-8B-gguf/resol...

sigmoid10 - 4 minutes ago

Thanks a lot, I was about to clone their llama.cpp branch and do the same.
Some more interesting tidbits from my go-to tests:
* Fails the car wash test (basic logic seems to be weak in general)
* Fails the "how many Rs in raspberry test" (not enough cross-token training data), but will funnily assume you may be talking about Indian Rupees and tell you a lot about raspberry prices in India without being asked. Possible Indian training data unbalance?
* Refuses to talk about Tiananmen square when pushed directly - despite being from a US company. Again, perhaps they are exposed to some censored training data? Anyways, when slowly set up along the conversation, it will tell you about the massacre. Also has no problem immediately talking about anything Gaza/Israel/US or other sensitive topics.
*

armanj - 6 hours ago

I did a quick benchmark & compared it with Qwen3.5: https://github.com/ArmanJR/PrismML-Bonsai-vs-Qwen3.5-Benchma...

in my results, accuracy-wise Ternary-Bonsai-8B is on par with Qwen3.5-4B. But in accuracy-per-byte, bonsai is the clear winner:

=> Ternary-Bonsai-1.7B achieved 65.1% from 462 MiB, beating Qwen3.5-0.8B by 12 points while being ~5% smaller on disk. => Ternary-Bonsai-4B is the accuracy-per-byte winner above 1 GiB. 83.0% from only 1.1 GiB, within 2 points of Qwen3.5-4B at 40% of the weight size.

they show strong promise on edge devices and where disk space is limited. I think this lab is worth watching.

zkmon - 40 minutes ago

The raw math: File size is hard-linked to parameter count and quant type. Intelligence is sort of linked to parameter count. Parameter count dictates the hardware requirement. What't left for the labs is, compressing more intelligence into lower parameter count, or packing more of specialized intelligence or buying up more hardware. Those are the only 3 directions all models/labs are heading.

usernametaken29 - 4 hours ago

I think it’s exciting to live in this quirky universe where we have simply accepted our hardware does weird and nonlinear stuff and that powers some math and that’s why your transform function works. Many people thought quantisation is not viable to the extent we see, but we clearly underestimated the effect of hardware on the actual non linearity of the models. Cool to see this pushed to the limits.

freakynit - 4 hours ago

Nature has already set an absurdly high bar. The human brain runs on roughly 20 watts, yet delivers a level of intelligence we still can't clearly define, let alone replicate. Nothing we've built comes close... either in capability or efficiency. We're still very early in understanding what "intelligence" even means, much less engineering it. so, we have a long way to go, and push.
- sbierwagen - 3 hours ago
  
  Depending on how you convert synapse count to parameters, the brain also has something like a thousand trillion parameters. In that light it's pretty darn surprising that an artificial neural network can produce anything like coherent text.
  - sally_glance - 10 minutes ago
    
    Maybe the brain is more akin to a network of networks and the actual reasoning part is not all that large? There are lots of areas dedicated exclusively to processing input and controlling subsystems. I can imagine a future where large artificial networks work in a similar way, with multiple smaller ones connected to each other.
  - freakynit - an hour ago
    
    It indeed is. We now have models less than 100M params producing pretty coherent, and somewhat relevant text to give input. That is indeed impressive.
    I believe the answer lies in how "quickly" (and how?) we are able to learn, and then generalize those learnings as well. As of now, these models need millions (at least) examples to learn, and are still not capable of generalizing the learnings to other domains. Human brains hardly need a few, and then, they generalize those pretty well.
- eru - 2 hours ago
  
  > Nothing we've built comes close... either in capability or efficiency.
  Only when you look at stuff that the brain is specifically good at.
  You can surpass the brain with even simple mechanical adders or an abacus in certain subdomains.
  - freakynit - 2 hours ago
    
    General intelligence I mean. What calculations even need to be performed and when, still comes from our brains.

Animats - 6 hours ago

This makes sense. The 1-bit model implies needing 2x as many neurons, because you need an extra level to invert. But the ternary model still has a sign, just really low resolution.

(I've been reading the MMLU-Redux questions for electrical engineering. They're very funny. Fifty years ago they might have been relevant. The references to the Intel 8085 date this to the mid-1970s. Moving coil meters were still a big thing back then. Ward-Leonard drives still drove some elevators and naval guns. This is supposed to be the hand-curated version of the questions. Where do they get this stuff? Old exams?)

[1] https://github.com/aryopg/mmlu-redux/blob/main/outputs/multi...

mchusma - 7 hours ago

Ever since I saw the first one of these one-bit models made by Microsoft, I thought this was a fascinating route. I assume that in practice, this is less helpful than it seems, just because there's every economic incentive in the world for the big AI labs to produce small, powerful, fast models. None of them seem to be using this technique, so it's interesting, but I suspect it's not quite working.

I also have yet to see any of these at a larger scale. For example, can you try one of these at 100 billion parameters?

yodon - 7 hours ago

So excited to see this - the big advantage of 1.58 bits is there are no multiplications at inference time, so you can run them on radically simpler and cheaper hardware.

Animats - 6 hours ago

At 4 bits, you could just have a hard-wired table lookup. Two 4 bit values in, 256 entry table. You can have saturating arithmetic and a post-processing function for free. Somebody must be building hardware like that.
- AlotOfReading - 4 hours ago
  
  A LUT is pretty wasteful. You only have a one bit significand, so the mantissa and sign bits are boolean binops, and the exponent is a 2 bit adder.
- Taniwha - 4 hours ago
  
  and so you can at 1-bit too, and the hardware will be even smaller and cheaper

syntex - an hour ago

hallucinates in pretty much every answer

est - an hour ago

installed since last HN post. So Bonsai (1-bit) and Ternary-Bonsai are different?

Can it be run on browsers with WASM/WebGPU?

WatchDog - 6 hours ago

All of their benchmarks are against 16 bit models right?

Why aren't they comparing to 2/3/4 bit quants?

himata4113 - 5 hours ago

looked at quant versions of these models and they all outperform it so I guess it just doesn't look as good.
mstr_anderson - an hour ago

[flagged]

ericb - 6 hours ago

This is pretty cool! I would love to see an even larger models shrunk down.

If you got that into a couple gigs--what could you stuff into 20 gigs?

gbgarbeb - 4 hours ago

When do we get 1100B Kimi K2.6 in 160 GB of memory at 1.125 bpw?

wmf - 7 hours ago

Yet again they're comparing against unquantized versions of other models. They would probably still win but by a much smaller size margin.

Dumbledumb - 6 hours ago

Wouldnt the margin be higher? All other models being moved from unquantized to quantized would lower their performance, while bonsai stays. I get what you see if it was in regards to score/modelsize, but not for absolute performance
- SwellJoe - 5 hours ago
  
  The metric they're selling this on is intelligence per byte, rather than total intelligence. So, if they used the quantized competing models, the intelligence per byte gap shrinks, because most models hold up very well down to 6-bit quantization, and 4-bit is usually still pretty good, though intelligence definitely tends to fall below 6-bit.
  Nonetheless, the Prism Bonsai models are impressive for their size. Where it falls apart is with knowledge. It has good prose/logic for a tiny model, and it's fast even on modest hardware, but it hallucinates a lot. Which makes sense. You can't fit the world's data in a couple of gigabytes. But, as a base model for fine-tuning for use cases where size matters, it's probably a great choice.
  - happygoose - 4 hours ago
    
    unfortunately, there doesn't seem to be a clear way to fine-tune these models yet. excited for when that happens though.

goofy_lemur - 5 hours ago

> On M4 Pro, Ternary Bonsai 8B runs at 82 toks/sec, roughly 5x faster than a 16-bit 8B model

Wow, if this is true, I am extremely impressed and excited!

I wonder about kv cache how much better it is as well!

TimorousBestie - 4 hours ago

This model tends to be annoyingly literal. An example from earlier today:

>> What are some names like Llewelyn?

> Some names like Llewelyn are Llewelyn, Llewelyn, Llewelyn, (repeats several times), and Llewelyn.