Bijou64: A variable-length integer encoding

inkandswitch.com

233 points by justinweiss 21 hours ago


kstenerud - 20 hours ago

The problem is that this breaks down once you try to use SIMD instructions. I'd developed a similar kind of approach to encoding integers (and ieee774 floats) a couple of years ago (first byte encodes length and first bit of data: https://github.com/kstenerud/bonjson/blob/05b91f6fe7d6b07186... ). It was very clever and used compiler intrinsics to get the length in 1 instruction, so 2 instructions got you the final value, with no branches.

But testing proved that when you move to SIMD instructions, ULEB128 (https://github.com/kstenerud/bonjson/blob/main/bonjson.md#ty...) or sentinel values (https://github.com/kstenerud/bonjson/blob/main/bonjson.md#lo...) win every time because of the parallelization opportunities.

The true irony is that even SIMD text parsing would outperform this! SIMD is that powerful.

i2talics - 18 hours ago

Non-canonical encodings are actually quite useful for some applications that need variable length integers. DWARF and WASM both use LEB128.

The problem is linking: a compiler needs to emit code into independent translation units, which contain "missing" references to symbols in other translation units, without yet knowing where all the code will end up in the final executable. Since we don't know where the location of other code is yet, we don't know how big the number representing that location is yet, which means that we don't know how wide the variable length encoding of that number will be. If the width changes after linking, then we have to push around the surrounding code to make space for the wider integer. Unfortunately, this changes the location of all the surrounding code, so we have to recompute all the references!

The solution is to always emit un-linked var ints in the widest possible encoding (5 bytes for LEB128) that way when the references are patched during linking, no code is moved around. All integers can be converted to a non-canonical 5 byte form that is "wasteful" but its a worthwhile tradeoff because it solves this issue. Other integers that don't need to be linked can be packed in a smaller var int form to save space.

wahern - 17 hours ago

This reminded me of ISO 7816-4 BER-TLV encodings, which uses the format defined in ISO/IEC 8825-1 (ASN.1 related spec). Length integer values of 0-127 are encoded in 1 byte. If the high bit is set, then the first 7 bits tell you the number of subsequent octets. So there's no offsetting involved, making it slightly less compact, but also dead simple.

EDIT: BUT, BER-TLV does permit overlong encodings. And I once found and reported a Yubikey 4 bug related to this. My source code comment for the workaround:

  -- The Yubikey 4 has an off-by-one bug which
  -- declares tag length of 255 (for the 0x53 outer
  -- tag of a certficate DO) when there are only 254
  -- bytes remaining in the reply. The reply is
  -- chained across two packets, but the off-by-one is
  -- probably related to the over-long encoded length
  -- (0x82 0x00 0xff instead of 0x81 0xff).
  --
  -- [snip packet captures]
  --
  -- Yubico's ykpiv_fetch_object function in ykpiv.c
  -- (confirmed 1.4.3-1.5.0) contains a read (memmove)
  -- overflow when the declared inner BER-TLV length
  -- (of the 0x53 tag) is longer than what was
  -- received over the wire. That makes Yubico's
  -- library oblivious to the issue. Relatedly, the
  -- set_length function has an off-by-one bug (length
  -- < 0xff instead of length <= 0xff) which produces
  -- an over-long encoded length. That doesn't by
  -- itself explain why the Yubikey 4 transmits a
  -- truncated logical reply unless the same code is
  -- being used.
stebalien - 20 hours ago

I've used LEB128 (with canonicalisation) extensively and... this looks so much nicer for most use-cases (length prefixed, supports the full uint64 range without that extra 10th byte).

The downside is the encoding size. LEB128 quickly grows to 2 bytes, but stays at 2 bytes all the way to 2^14. This is important if you're using these numbers as tags/identifiers as we were in the multicodec [1] project, or for network message lengths. bijou64 only gives you 500 <= 2 byte numbers.

[1]: https://github.com/multiformats/multicodec

jdougan - 13 minutes ago

Reminds me of the Xanadu Humber encoding.

boricj - 20 hours ago

I'm working on a C++ library at work that binds wire data and application data through token and model layers, which includes among other things a fair amount of tokenizers/composers for various formats (JSON, CBOR, BSON, CSV...).

This looks neat, but if encoding/decoding performance is important, payload size isn't and the integer is bounded, I would just put a fixed-size integer into the payload as-is.

LEB128 (and JSON for that matter) can encode integer values of arbitrary length. This doesn't, which may or may not be important but it's different.

I'll admit that I do not do any cryptographic work with my library and therefore canonical representations aren't a huge concern in my use-cases. I merely provide various configurable limits (max value length, max depth, max items per collection) in an effort to prevent infinitely long documents from hogging my tokenizers indefinitely.

omoikane - 20 hours ago

UTF-8 has the same issue ("overlong encoding") where multiple representations are possible the same code point. Someone proposed a similar tweak to remove the overlapping ranges by adjusting the base offset for byte sequences that are longer than 1. That was discussed here:

https://news.ycombinator.com/item?id=44456073 - Corrected UTF-8 (2025-07-03, 54 comments)

This "corrected UTF-8" has other problems, but I thought it's interesting how the shifted-offset idea carries over.

billpg - 19 hours ago

I forget where I encountered it, but I've seen similar encodings that eliminated the possibility of many possible encodings for the same number by making the length part of the value.

Values 0-127 are a single byte, but if that first byte has the continuation bit set, not only does that indicate the next byte has 7 more bits to contribute, it also moves the base up to the next window.

10000000 00000000 is the only way to represent 128.

10000000 10000000 00000000 is the only way to represent 16512.

Does this encoding have a name?

dzaima - 12 hours ago

Kinda surprised that there's no discussion on that this basically just does not solve the non-canonicality problem.

Forgetting to do the range check on the first_byte==255 case and just letting it do 64-bit wraparound is exactly as much of a plausible bug as missing range checks on LEB128. Any test suite with the goal of covering canonicality will trivially cover both properly; and a programmer that implements things by reading 7 words into the spec, saying "oh yeah I got this" and goes to implement what seems simple, will write a broken version of both.

Perhaps the biggest benefit is just not being associated with a format that tolerates non-canonicality in other places (though, if bijou64 gains traction, it'll only be a matter of time for wraparound-check-less versions to start appearing in places where the wraparound is fine); and I guess also it being less annoying to implement the canonicality check, though hopefully people writing security-sensitive software aren't ones to skip out on correctness checks due to annoyingness.

In a sense, bijou64 could perhaps even be more problematic - it invites not doing any range checks for the smaller inputs because they obviously don't need it, and so you can just forget to special-case the max length case; whereas LEB128 makes you already care about it at the first point it is actually LEB128.

(of course, the format does still have other benefits; enforced canonicality is just...not one of them)

arkenflame - 18 hours ago

I researched many different varint encodings for a GraphQL-specific binary format (resulting in Argo: https://github.com/msolomon/argo ). I ended up choosing protobuf-style zig-zag varints, but I also found these interesting:

vu128: https://john-millikin.com/vu128-efficient-variable-length-in...

metric/imperial varint: https://dcreager.net/2021/03/a-better-varint/

vectorizing VByte: https://arxiv.org/abs/1503.07387

conaclos - 18 hours ago

This is pretty close to SQLite's varints [0]

[0]: https://www.sqlite.org/src4/doc/1433690d7b/www/varint.wiki

willtemperley - 20 hours ago

Maybe someone can explain why an encoder would ever create the padding bytes allowed in LEB128. I contributed the parser for LEB128 in apple/swift-binary-parsing and I’m still none the wiser. I’m genuinely mystified.

aDyslecticCrow - 19 hours ago

Clever, but one thought crossed my mind;

An adveserial package can claim to have a 255 tagged integer but not actually have any followup, tricking the payload parser into an incorrect offset and reading straight off into followup memory.

It's a classic thing to check for when dealing with variable length strings or binary, but it may not cross the mind when it's hiding in the Bijou64_decode(*buff, *cr) function.

nine_k - 20 hours ago

In short: instead of a truly indefinite-length solution with a signal bit on the current byte saying whether to check the next byte, this uses a counter. Values 0x0 to 0xF7 are one-byte integers, 0xF8 to 0xFF use the upper 5 bits as a counter for the number of subsequent bytes. This limits the maximum magnitude to slightly less than 2 ^ 264 (almost all 33-byte values), which seems to be okay for practical computations. The proposed standard limits the supported size to u64 though.

The upsides: the size of the integer is apparent upon reading the first byte, and every number has exactly one canonical representation. I wish C strings had been standardized around something similar, instead on null termination.

> ...adversarial input, which is rarely in the test suite.

This made my scratch my head. My tests for quite pedestrian APIs often contain adversarial input of obvious shapes. I though that for anything security-related (like the author's project) testing against adversarial input would be be a prominent part.

harrisi - 18 hours ago

I'm surprised there's no mention in the post or here about SQLite's varint encoding. Not that it would necessarily satisfy the constraints, but it's one of the most used varint implementations.

Aardwolf - 5 hours ago

> The payload is a single contiguous big-endian integer

Why not little endian like modern CPUs?

juancn - 14 hours ago

I like the denormalization of VLE ints (with or without zig-zag encoding of negatives), it helps support out of band information, such as nulls and other signals in serialization protocols with minimal overhead.

For example you can use a denormalized zero to signal null.

You can still define a canonical encoding where denormalizations have specific meaning or signal an error.

michaelmure - 19 hours ago

One nice upside of having a single way to encode a value is fuzzing: when you work on an encoder/decoder, you can use a fuzzer and do round-trip comparison until you find crashes or inputs/outputs that don't match (and therefore issues in the code). But with LEB128 for example, the fuzzer quickly learn about those alternatives encoding and there is not much you can do from there.

HansHamster - 20 hours ago

It feels a bit unfair to say that it is faster by being able to tell the total length from the first byte and capping it at 64 bit, while some of the other formats can store arbitrarily large integers. I guess you could use another variable length encoding for the prefix at the cost of some performance and using even more space...

yread - 19 hours ago

Wouldn't something like this also work: https://en.wikipedia.org/wiki/Elias_omega_coding I've used to great effect for compression

MarkusQ - 17 hours ago

> This causes problems for signed data

Given that the context up to this point had been representation of integers, I initially trip on this. :)

apitman - 16 hours ago

This reminds me of the varint encoding used by QUIC, but I've never implemented it. Anyone know the differences?

cantalopes - 20 hours ago

I love the random hyperlink underlines on that page

TZubiri - 4 hours ago

>The check is forgotten, optimised away, or never ported. The protocol’s security property silently degrades. This is the bug class bijou64 is designed to make impossible. Not by adding more checks, but by removing the one that mattered — and making the format such that, with no canonicality check at all, the only encoding that exists for any given value is the canonical one

Here's two passing tests for software craftmanship:

1) it looks decades into the past 2) it looks decades into the future

amluto - 19 hours ago

Just a quick reminder:

> This causes problems for signed data if you ever want to do things like compression since you need to know the exact bytes that were signed.

If you are verifying a signature by taking some logical data structure, turning it into a byte string, and calling the verification primitive on those bytes, you likely have a design error. You should instead collect bytes, verify the signature, and then parse the bytes after verifying the signature. And remember to include enough context in those bytes so a different message signed for a different purpose by the same key doesn’t confuse you.

dekdrop - 13 hours ago

How do I get to this page from the home page?

alexpandey - 8 hours ago

[dead]

alex-reyss - 19 hours ago

[dead]

RedShift1 - 20 hours ago

This seems quite convoluted just to avoid the "0 can be represented in more than one way" problem.