MessagePack: It's like JSON, but fast and small.

102 points by davikr 6 months ago

I played quite a bit with MessagePack, used it for various things, and I don't like it. My primary gripes are:

+ The Object and Array needs to be entirely and deep parsed. You cannot skip them.

+ Object and Array cannot be streamed when writing. They require a 'count' at the beginning, and since the 'count' size can vary in number of bytes, you can't even "walk back" and update it. It would have been MUCH, MUCH better to have a "begin" and "end" tag --- err pretty much like JSON has, really.

You can alleviate the problems by using extensions, store a byte count to skip etc etc but really, if you start there, might as well use another format altogether.

Also, from my tests, it is not particularly more compact, unless again you spend some time and add a hash table for keys and embed that -- but then again, at that point where it becomes valuable, might as well gzip the JSON!

So in the end it is a lot better in my experience to use some sort of 'extended' JSON format, with the idiocies removed (trailing commas, forcing double-quote for keys etc).

your_fin - 6 months ago

I must further prothelitize Amazon Ion, which solves for most of the listed complaints and is criminally underused:
https://amazon-ion.github.io/ion-docs/
The "object and array need to be entirely and deep parsed" and "object and array cannot be streamed when writing" are somewhat incompatible from a (theoretical) parsing perspective, though; you need to know how far to skip ahead in order to do so.
I agree that it is silly to design an efficiency-oriented format that does neither, though. Ion chooses to be shallow parsed efficiently, although it also makes affordances for streams of top-level values explicitly in the spec.
crabmusket - 6 months ago

> trailing commas, forcing double-quote for keys etc
How do these things matter in any use case where a binary protocol might be a viable alternative? These specific issues are problems for human-readability and -writability, right? But if msgpack was a viable technology for a particular use case, those concerns must already not exist.
- nullfield - 6 months ago
  
  I think this is the point-when one wanted “easy” parsing and readability for humans, they abandoned binary protocols for JSON; now, people who are finding some performance issue they don’t like are trying to start over and re-learn all the past lessons of why and how binary protocols were used in the first place.
  The cost will be high, just like the cost for having to relearn CS basics for non-trivial JS use was/is.
masklinn - 6 months ago

> Object and Array cannot be streamed when writing. They require a 'count' at the beginning
Most languages know exactly how many elements a collection has (to say nothing of the number of members in a struct).
- atombender - 6 months ago
  
  Not if you're streaming input data where you cannot know the size ahead of time, and you want to pipeline the processing so that output is written in lockstep with the input. It might not be the entire dataset that's streamed.
  For example, consider serializing something like [fetch(url1), join(fetch(url2), fetch(url3))]. The outer count is knowable, but the inner isn't. Even if the size of fetch(url2) and fetch(url3) are known, evaluating a join function may produce an unknown number of matches in its (streaming) output.
  JSON, Protobuf, etc. can be very efficiently streamed, but it sounds like MessagePack is not designed for this. So processing the above would require pre-rendering the data in memory and then serializing it, which may require too much memory.
  - baobun - 6 months ago
    
    > JSON, Protobuf, etc. can be very efficiently streamed
    Protobuf yes, JSON no: you can't properly deserialize a JSON collection until it is fully consumed. The same issue you're highlighting for serializing MessagePack occurs when deserializing JSON. I think MessagePack is very much written with streaming in mind. It makes sense to trade write-efficiency for read-efficiency. Especially as the entity primarily affected by the tradeoff is the one making the cut, in case of msgpack. It all depends on your workloads but Ive done benchmarks for past work where msgpack came up on top. It can often be a good fit for when you need to do stuff in Redis.
    (If anyone thinks to counter with JSONL, well, there's no reason you can't do the same with msgpack).
    
    laurencerowe - 6 months ago
    
    The advantage of JSON for streaming is on serialization. A server can begin streaming the response to the client before the length of the data is known.
    JSON Lines is particularly helpful for JavaScript clients where streaming JSON parsers tend to be much slower than JSON.parse.
    
    atombender - 6 months ago
    
    Sorry, I was mentally thinking of writing mostly. With JSON the main problem is, as you say, read efficiency.
- touisteur - 6 months ago
  
  I think the pattern in question might be (for example) the way some people (like me) sometimes write JSON as a trace of execution, sometimes directly to stdout (so, no going back in the stream). You're not serializing a structure but directly writing it as you go. So you don't know in advance how many objects you'll have in an array.
  - - 6 months ago
    
    [deleted]
- buserror - 6 months ago
  
  If your code is compartimented properly, a lower layers (sub objects) doesn't have to have to do all kind of preparations just because a higher layer has "special needs".
  For example, pseudo code in a sub-function:
  if (that) write_field('that'); if (these) write_field('these');
  With messagepack you have to go and apply the logic to count, then again to write. And keep a state for each levels etc.
  - masklinn - 6 months ago
    
    That sounds less than compartmentalisation and more like you having special needs and being unhappy they are not catered to.
- baobun - 6 months ago
  
  The topic is data serialiasation formats, not programming languages.

ubutler - 6 months ago

Although MessagePack is definitely not a drop-in replacement for JSON, it is certainly extremely useful.

Unlike JSON, you can’t just open a MessagePack file in Notepad or vim and have it make sense. It’s often not human readable. So using MessagePack to store config files probably isn’t a good idea if you or your users will ever need to read them for debugging purposes.

But as a format for something like IPC or high-performance, low-latency communication in general, MessagePack brings serious improvements over JSON.

I recently had to build an inference server that needed to be able to communicate with an API server with minimal latency.

I started with gRPC and protobuf since it’s what everyone recommends, yet after a lot of benchmarking, I found a way faster method to be serving MessagePack over HTTP with a Litestar Python server (it’s much faster than FastAPI), using msgspec for super fast MessagePack encoding and ormsgpack for super fast decoding.

Not sure how this beat protobuf and gRPC but it did. Perhaps the Python implementation is just slow. It was still faster than JSON over HTTP, however.

junon - 6 months ago

Makes me wish Cap'n Proto was more prevalent and developer friendly. It ticks quite a few of my boxes.
- 6 months ago

[deleted]

mdhb - 6 months ago

CBOR: It’s like JSON but fast and small but also an official IETF standard.

https://cbor.io/

camgunz - 6 months ago

Disclaimer: I wrote and maintain a MessagePack implementation.
CBOR is MessagePack. The story is that Carsten Bormann wanted to create an IETF standardized MP version, the creators asked him not to (after he acted in pretty bad faith), he forked off a version, added some very ill-advised tweaks, named it after himself, and submitted it anyway.
I wrote this up years ago (https://news.ycombinator.com/item?id=14072598), and since then the only thing they've addressed is undefined behavior when a decoder encounters an unknown simple value.
- mdhb - 6 months ago
  
  Interesting bit of context I wasn’t aware of. Thanks for that.
nmadden - 6 months ago

CBOR is basically a fork of MsgPack. I prefer the original - it’s simpler and there are more high-quality implementations available.
- jpc0 - 6 months ago
  
  CBOR is actively used in the WebAuthN spec (passkeys) so browsers ship with en implementation... And if you intend to support it even via a library you will be shipping an implementation as well.
  https://www.w3.org/TR/webauthn-2/#sctn-conforming-all-classe...
  - camgunz - 6 months ago
    
    Disclaimer: I wrote and maintain a MessagePack implementation.
    Reading through this, it looks like they toss out indefinite length values, "canonicalization", and tags, making it essentially MP (MP does have extension types, I should say).
    https://fidoalliance.org/specs/fido-v2.0-ps-20190130/fido-cl...
  - throw5959 - 6 months ago
    
    Which Web API can encode and decode CBOR? I'm not aware of any, and unless I'm mistaken you will need to ship your own implementation in any case.
feverzsj - 6 months ago

It's too complex, and the implementation is poor[0].
[0]: https://github.com/getml/reflect-cpp/tree/main/benchmarks
- otabdeveloper4 - 6 months ago
  
  CBOR is a standard, not an implementation.
  As a standard it's almost exactly the same as MsgPack, the difference is mostly just that CBOR filled out underspecified parts of MsgPack. (Things like how extensions for custom types work, etc.)
- tjoff - 6 months ago
  
  Implementation is poor because of performance?
  Performance is just one aspect, and using poor to describe it is very misleading. Say not performant if that is what you meant.

toomim - 6 months ago

MessagePack saves a little bit of space and CPU ... but not a lot:

https://media.licdn.com/dms/image/v2/D5612AQF-nFt1cYZhKg/art...

Source: https://www.linkedin.com/pulse/json-vs-messagepack-battle-da...

koito17 - 6 months ago

An approximate 20% reduction in bandwidth looks significant to me. I think the problem here is that the chart uses a linear scale instead of a logarithmic scale.
Looking at the data, I'm inclined to agree that not much CPU is saved, but the point of MessagePack is to save bandwidth, and it seems to be doing a good job at that.
- motorest - 6 months ago
  
  > An approximate 20% reduction in bandwidth looks significant to me.
  Significante with regards to what? Not doing anything? Flipping the toggle to compress the response?
- tasuki - 6 months ago
  
  > An approximate 20% reduction in bandwidth looks significant to me.
  To me it doesn't. There's compression for much bigger gains. Or just, you know, just send less data?
  I've worked at a place where our backend regularly sent humongous jsons to all the connected clients. We were all pretty sure this could be reduced by 95%. But, who would try to do that? There wasn't a business case. If someone tried succeeded, no one would notice. If someone tried and broke something, it'd look bad. So, status quo...
  - bythreads - 6 months ago
    
    In a system that requires the absolute speediest throughput compression is actually usually the worst thing in a parsechain - so parsing without first decompression is valuable.
    I've tried messagepack a few times, but to be honest the hassle of the debugging was never really worth it
  - junon - 6 months ago
    
    Compression is a performance killer for intra-DC communication. You typically avoid it at all costs when doing RPC within the same AZ/DC.
    Thus, the only thing you can do after that to improve performance is to reduce bytes on the wire.
    
    imtringued - 6 months ago
    
    Okay so you just use plain JSON again...
    In a discussion about using messagepack that doesn't really sound like messagepack is winning.
    
    junon - 6 months ago
    
    A 20% reduction of bytes on the wire is not losing.
    99.9999999% of the data sent across the network in that Datacenter will never be read by humans directly. Why put it into a textual form?
jauntywundrkind - 6 months ago

CBOR is derived from MessagePack, and is used throughout BlueSky/AtProto.
Reports there are that JSON is still the speed champ, by a healthy measure. I am among many who seem to chalk this up to JSON encoding/decoding being fantastically well optimized, having many heavily-invested in very well tuned & optimized libraries available.
(Ed: well, https://github.com/djkoloski/rust_serialization_benchmark only shows rust options, but somewhat contradicting this. JSON encoding appears ~3x slower, when looking at best CBOR & JSON performances)
This article feels quite broad & high level, as a general introduction. With a couple graphs thrown in at the end to try to close the deal. The space savings I tend to think are probably reasonably representative regardless of the library being tested. But the speed is going to vary greatly, and this is one of very few examples I've seen where CBOR comes out faster. The article does not provide any information that I can see about what the test is, what data or libraries are being tested.
It is worth noting that CBOR has heightened interest very recently, as the recent Kubernetes 1.32 shipped an alpha feature to talk in CBOR. The very good library below has gotten some good attention. https://github.com/fxamacker/cbor
dns_snek - 6 months ago

That's a bad benchmark, it doesn't show what type of data is being encoded or the programming language.
Encoding/Decoding an array of strings in Javascript is going to have a completely different performance profile than Encoding/Decoding an array of floats in a lower level language like C.
- crabmusket - 6 months ago
  
  I came here to say this. I did my own benchmarks on JSON versus msgpack. I shall have to reproduce and publish them one day.
  If you have a lot of strings, or lots of objects where the total data in keys is similar to the total data in values, then msgpack doesn't help much.
  But when you have arrays of floats (which some systems at my work have a lot of), and if you want to add a simple extension to make msgpack understand e.g. JavaScript's TypedArray family, you can get some very large speedups without much work.

janalsncm - 6 months ago

In my experience protobuf was smaller than MessagePack. I even tried compressing both with zstd and protobuf was still smaller. On the other hand protobuf is a lot less flexible.

comonoid - 6 months ago

MessagePack is self-describing (it contains tags like "next bytes are an integer"), but Protobuf uses external scheme.
- jabwd - 6 months ago
  
  You can decode protobuf in the same way, I've written several decoders in the past that don't rely on an external schema. There are some types you can't always decode with a 100% confidence, but then again JSON or something like it isn't strongly typed either.
  - SigmundA - 6 months ago
    
    protobuf doesn't have key names without the schema right?
    Wouldn't it just decode to 1,type,value 2,type,value without the schema no names?
    Human readable key names is a big part of what makes a self describing format useful but also contributes to bloat a format with an embedded schema in the header would help.
    
    jabwd - 6 months ago
    
    Sorry I never ended up replying. You are correct, in protobuf you have field numbers ( and I have to admit what I commented has been a number of years ago by now my knowledge is crusty ). Usually when doing this type of work, as I was doing, is to reverse some protocol. It didn't make me hate the format though; many protocols I had reversed in the past were vastly more complicated to decode and understand without having field names. If you ever want to melt your brain get into RTMP decoding :3
  - comonoid - 6 months ago
    
    Indeed.

karteum - 6 months ago

I discovered JSON Binpack recently, which works either schemaless (like msgpack) or - supposedly more efficiently - with a schema. I haven't tried the codebase yet but it looks interesting.

https://jsonbinpack.sourcemeta.com/

fshafique - 6 months ago

Does it solve the problem of repeating set of keys in an object array, eg. when representing a table?

I don't think using a dictionary of key values is the way to go here. I think there should be a dedicated "table" type, where the column keys are only defined once, and not repeated for every single row.

ubutler - 6 months ago

MessagePack can encode rows as well and then you just need to manage linking the keys during deserialization. In fact, it can encode arbitrary binary without needing base64 like JSON.
feverzsj - 6 months ago

You can just use array of array like most scientific applications do.
- fshafique - 6 months ago
  
  I can do that with JSON too. I was hoping MessagePack would have built-in functionality to do it.

sd9 - 6 months ago

It drops the most useful aspect of JSON, which is that you can open it in a text editor.

It's like JSON in that it's a serialisation format.

PittleyDunkin - 6 months ago

And here I was thinking serialization was the most useful aspect of json (and it's not even great at that)
- majewsky - 6 months ago
  
  Serialization is what it is, not why it is useful. The most useful aspect of JSON is undoubtedly wide support (historically, that every web browser carried a parser implementation from day 1; contemporarily, the availability as a library in nearly every programming language, often in std, plus shell accessibility through jq).
  - PittleyDunkin - 6 months ago
    
    > Serialization is what it is, not why it is useful.
    Is there a non-teleological manner in which to evaluate standards?
    > The most useful aspect of JSON is undoubtedly wide support
    This is a fantastic example of how widespread technology doesn't imply quality.
    Don't get me wrong, I love JSON. It's a useful format with many implementations of varying quality. But it's also a major pain in the ass to deal with: encoding errors, syntax errors, no byte syntax, schemas are horribly implemented. It's used because it's popular, not because it has some particular benefit.
    In fact, I'd argue JSON's largest benefit as opposed to competitive serializers has been to not give a fuck about the quality of (de)serialization. Who gives a fuck about the semantics of parsing a number when that's your problem?!?

vdqtp3 - 6 months ago

Ignorant question - is the relatively small size benefit worth another standard that's fairly opaque to troubleshooting and loses readability?

Is there a direct comparison of why someone should choose this over alternatives? 27 bytes down to 18 bytes (for their example) just doesn't seem like enough of a benefit. This clearly isn't targeted to me in either case, but for someone without much knowledge of the space, it seems like a solution in search of a problem.

willhbr - 6 months ago

If you need a format that can transport byte arrays unmodified (image data, etc), msgpack (or protos or whatever) is much better than JSON since you don't have to base64 encode or escape the data. It also supports non-string keys which can be convenient.
crabbone - 6 months ago

From cursory reading of the specification, the format doesn't seem to offer anything groundbreaking, no particular benefits compared to other similar formats.
Whatever your messaging format is going to be, the performance will mostly depend on the application developer and their understanding of the specifics of the format. So, the 20% figure seems arbitrary.
In practical terms, I'd say: if you feel confident about dealing with binary formats and like fiddling with this side of your application, probably, making your own is the best way to go. If you don't like or don't know how to do that, then, probably, choosing the one that has the most mature and robust tools around it is the best option.
----
NB. It's also useful to remember that data transfer of the network is discrete, with the minimum chunk of information being MTU. So, for example, if most of the messages exchanged by the application were smaller than one MTU before attempting to optimize for size, then making these messages shorter will yield no tangible benefit. It's really only worth to start thinking about optimizations when a significant portion of the messages are measured in at least low double digits of MTUs, if we believe in the 20% figure.
It's a similar situation with the storage, which is also discrete, with the minimum chunks being one block. Similar reasoning applies here as well.
div3rs3 - 6 months ago

I'd argue the value goes up with larger payloads. The tradeoff is ease of use vs efficiency.
diath - 6 months ago

It's useful when dealing with high traffic networked services, the little saves here and there have compounding effects over time and save you a lot of bandwidth.
2muchcoffeeman - 6 months ago

I’m not sure why you wouldn’t just develop in json, then flick a switch to use binary.
palmfacehn - 6 months ago

It starts to make sense if you are returning a large array of objects and each object contains several long values. Unfortunately it looks like msgpack doesn't support u128 or arbitrary precision big integers. I suppose you can always cast to byte[].
goodpoint - 6 months ago

MessagePack and CBOR allow zero-copy parsing.
cryptonector - 6 months ago

We already have CBOR and other binary JSONs.
- jakewins - 6 months ago
  
  CBOR is a clone of msgpack: https://en.m.wikipedia.org/wiki/CBOR

accelbred - 6 months ago

I really like MessagePack as a serialization format but it does leave a lot underspecified, such as how to handle decoding/encoding integers. While the encoding has signed and unsigned types, the data model does not. Its not straightforward to translate to a data midel with separate signed and unsigned types.

The separation between str and bin types doesnt make sense for my usecases, and others with similar usecases either always use bin or always use str. logically bin would make sense, but str has fixstr and theres no fixbin, so more efficient to use str. Both are valid but thats more discrepancies to deal with.

For my personal usecases I'm thinking of forking MessagePack, removing str and ext, reallocting tag space a bit, having a clear mapping to low level language types, adding 128 bit integers, and maybe a frame type for streams.

slurpyb - 6 months ago

Shouts to msgspec - i havent had a project without it in awhile.

ubutler - 6 months ago

+1 It’s almost as indispensable as tqdm for a data scientist at least.

dallbee - 6 months ago

A trick I often use to get the most out of messagepack is using array encoding of structs. Most msgp libraries have support for this.

This gives you a key benefit of protobuf, without needing external schema files: you don't have to pay the space of the keys of your data.

This is simply not something you can do with JSON, and depending on your data can yield substantial space savings.

xcircle - 6 months ago

In our C++ project, we use the nlohmann library for handling JSON data. When sending JSON over MQTT, we leverage a built-in function of nlohmann to serialize our JSON objects into the MessagePack format. You can simply call jsonObj.to_msgpack(). Similarly, the data can be decoded back easily.

mattbillenstein - 6 months ago

I've built a few systems using msgpack-rpc - serves really well as a transport format in my experience!

jedisct1 - 6 months ago

ZigPack is a pretty good MessagePack implementation written in Zig: https://github.com/thislight/zigpak

eeasss - 6 months ago

Serialziation vulnerabilities anyone

_vrtq - 6 months ago

I like how it makes binary easy. I wish there was a simple way to do binary in JSON.

agieocean - 6 months ago

Made an open image format with this for constrained networks and it works great