From XML to JSON to CBOR

85 GarethX 97 7/30/2025, 10:31:12 AM cborbook.com ↗

Comments (97)

mrbluecoat · 1d ago
Feels like a CBOR ad to me. I agree that most techs are familiar with XML and JSON, but calling CBOR a "pivotal data format" is a stretch compared to Protobuf, Parquet, Avro, Cap'n Proto, and many others: https://en.m.wikipedia.org/wiki/Comparison_of_data-serializa...
ognyankulev · 1d ago
The fact that the long article misses to make the historical/continuation link to MessagePack is by itself a red flag signalling a CBOR ad.

Edit: OK, actually there is a separate page for alternatives: https://cborbook.com/introduction/cbor_vs_the_other_guys.htm...

mikepurvis · 1d ago
Notably missing is a comparison to Cap'n Proto, which to me feels like the best set of tradeoffs for more binary interchange needs.

I honestly wonder sometimes if it's held back by the name— I love the campiness of it, but I feel like it could be a barrier to being taken seriously in some environments.

aidenn0 · 1d ago
Doesn't Cap'n Proto require the receiver to know the types for proper decoding? This wouldn't entirely disqualify it from comparison, since e.g. protobufs are that way as well, but they make it less interesting for comparing to CBOR, which is type-tagged.
e_y_ · 1d ago
There's quite a few formats that are self-describing already, so having a format that can skip the type and key tagging for that extra little bit of compactness and decoding efficiency is a unique selling point.

There's also nothing stopping you from serializing unstructured data using an array of key/value structs, with a union for the value to allow for different value types (int/float/string/object/etc), although it probably wouldn't be as efficient as something like CBOR for that purpose. It could make sense if most of the data is well-defined but you want to add additional properties/metadata.

Many languages take unstructured data like JSON and parse them into a strongly-typed class (throwing validation errors if it doesn't map correctly) anyways, so having a predefined schema is not entirely a bad thing. It does make you think a bit harder about backwards-compatibility and versioning. It also probably works better when you own the code for both the sender and receiver, rather than for a format that anyone can use.

Finally, maybe not a practical thing and something that I've never seen used in practice: in theory you could send a copy of the schema definition as a preamble to the data. If you're sending 10000 records and they all have the same fields in the same order, why waste bits/bytes tagging the key name and type for every record, when you could send a header describing the struct layout. Or if it's a large schema, you could request it from the server on demand, using an id/version/hash to check if you already have it.

In practice though, 1) you probably need to map the unknown/foreign schema into your own objects anyways, and 2) most people would just zlib compress the stream to get rid of repeated key names and call it a day. But the optimizer in me says why burn all those CPU cycles decompressing and decoding the same field names over and over. CBOR could have easily added optional support for a dictionary of key strings to the header, for applications where the keys are known ahead of time, for example. (My guess is that they didn't because it would be harder for extremely-resource-constrained microcontrollers to implement).

kentonv · 1d ago
> I feel like it could be a barrier to being taken seriously in some environments.

Working as intended. ;)

kentonv · 1d ago
In all seriousness: I develop Cap'n Proto to serve my own projects that use it, such at the Cloudflare Workers runtime. It is actually not my goal to see Cap'n Proto itself adopted widely. I mean, don't get me wrong, it'd be cool, but it isn't really a net benefit to me personally: maybe people will contribute useful features, but mostly they will probably just want me to review their PRs adding features I don't care about, or worse, demand I fix things I don't care about, and that's just unpaid labor for me. So mostly I'm happy with them not.

It's entirely possible that this will change in the future: like maybe we'll decide that Cap'n Proto should be a public-facing part of the Cloudflare Workers platform (rather than just an implementation detail, as it is today), in which case adoption would then benefit Workers and thus me. At the moment, though, that's not the plan.

In any case, if there's some company that fancies themselves Too Serious to use a technology with such a silly name and web site, I am perfectly happy for them to not use it! :)

mikepurvis · 1d ago
Ha, I wondered if you'd comment. Yeah that's a reasonable take.

I think for me it would be less about locking out stodgy, boring companies, and perhaps instead it being an issue for emerging platforms that are themselves concerned with the optics of being "taken seriously". I'm specifically in the robotics space, and over the past ten years ROS has been basically rewritten to be based around DDS, and I know during the evaluation process for that there were prototypes kicked around that would have been more webtech type stuff, things like 0mq, protobufs, etc. In the end the decision for DDS was made on technical merits, but I still suspect that it being a thing that had preexisting traction in aerospace and especially NASA influenced that.

stronglikedan · 1d ago
man I love HN lol
abrookewood · 1d ago
Have to agree. I've heard of every format you mentioned, but never heard of CBOR.
pelagicAustral · 1d ago
I first heard of it while developing a QR code travel passport during the Covid era... the technical specification included CBOR as part of the implementation requirement. Past this, I have not crossed path with it again...
f_devd · 1d ago
I would agree their claim is a bit early, but I think a key difference between those you mentioned and CBOR is the stability expectation. Protobuf/Parquet/etc are usually single-source libraries/frameworks, which can be changed quite quickly, while CBOR seems to be going for a spec-first approach.
darthrupert · 1d ago
CBOR is just a standard data format. Why would it need an ad? What are they selling here?
Retr0id · 1d ago
A lot of people (myself included) are working on tools and protocols that interoperate via CBOR. Nobody is selling CBOR itself, but I for one have a vested interest in promoting CBOR adoption (which makes it sound nefarious but in reality I just think it's a neat format, when you add canonicalization).

CBOR isn't special here, similar incentives could apply to just about any format - but JSON for example is already so ubiquitous that nobody needs to promote it.

8n4vidtmkvmk · 1d ago
If I adopt a technology, I probably don't want it to die out. Widespread support is generally good for all that use it.
_the_inflator · 1d ago
Love or hate JSON, the beauty and utility stem from the fact that you have only the fundamental datatypes as a requirement, and that's it.

Structured data that, by nesting, pleases the human eye, reduced to the max in a key-value fashion, pure minimalism.

And while you have to write type converters all the time for datetime, BLOBs etc., these converters are the real reasons why JSON is so useful: every OS or framework provides the heavy lifting for it.

So any elaborated new silver bullet would require solving the converter/mapper problem, which it can't.

And you can complain or explain with JSON: "Comments not a feature?! WTF!" - Add a field with the key "comment"

Some smart guys went the extra mile and nevertheless demanded more, because wouldn't it be nice to have some sort of "strict JSON"? JSON schema was born.

And here you can visibly experience the inner conflict of "on the one hand" vs "on the other hand". Applying schemas to JSON is a good cause and reasonable, but guess what happens to JSON? It looks like unreadable bloat, which means XML.

Extensibility is fine, basic operations appeal to both demands, simple and sophisticated, and don't impose the sophistication on you just for a simple 3-field exchange about dog food preferences.

sevensor · 1d ago
My complaint about JSON is that it’s not minimal enough. The receiver always has to validate anyway, so what has syntax typing done for us? Different implementations of JSON disagree about what constitutes a valid value. For instance, is

    {“x”: NaN}
valid JSON? How about 9007199254740993? Or -.053? If so, will that text round trip through your JSON library without loss of precision? Is that desirable if it does?

Basically I think formats with syntax typed primitives always run into this problem: even if the encoder and decoder are consistent with each other about what the values are, the receiver still has to decide whether it can use the result. This after all is the main benefit of a library like Pydantic. But if we’re doing all this work to make sure the object is correct, we know what the value types are supposed to be on the receiving end, so why are we making a needlessly complex decoder guess for us?

aidenn0 · 1d ago
NaN is not a valid value in JSON. Neither are 0123 or .123 (there must always be at least one digit before the decimal marker, but extraneous leading zeroes are disallowed).

JSON was originally parsed in javascript with eval() which allowed many things that aren't JSON through, but that doesn't make JSON more complex.

sevensor · 1d ago
That’s my point, though! I’ve run into popular JSON libraries that will emit all of those! 9007199254740993 is problematic because it’s not representable as a 64 bit float. Python’s JSON library is happy to write it, even though you need an int to represent it, and JSON doesn’t have ints.

Edit: I didn’t see my thought all the way through here. Syntax typing invites this kind of nonconformity, because different programming languages mean different things by “number,” “string,” “date,” or even “null.” They will bend the format to match their own semantics, resulting in incompatibility.

aidenn0 · 15h ago
Before your edit, I was going to object to your premise because it seems like a format could get worse just by more implementations being made.

After your edit, I see that it's rather that syntax-typed formats are prone to this form of implementation divergence.

I don't think this is limited to syntax-typed formats though. For example, TNetstrings[1] have type tags, but "#" is an integer. The specification requires that integers fit into 63 bits (since the reference encoder will refuse to encode a python long), but implementations in C tend to allow 64 bits and in other languages allow bignums. It does explicitly allow "nan", "inf", and "-inf" FWIW.

1: https://tnetstrings.info/

sevensor · 13h ago
Agreed; I think there’s a problem with self-describing data as a concept. It just begs for implementation defined weirdness.
dragonwriter · 1d ago
> 9007199254740993 is problematic because it’s not representable as a 64 bit float. Python’s JSON library is happy to write it, even though you need an int to represent it

JSON numbers have unlimited range in terms of the format standard, but implementations are explicitly permitted to set limits on the range and precision they generate and handle, and users are warned that:

   [...] Since software that implements IEEE 754 binary64 (double precision)
   numbers is generally available and widely used, good interoperability can be 
   achieved by implementations that expect no more precision or range than these
   provide, in the sense that implementations will approximate JSON
   numbers within the expected precision.
Also, you don't need an int to represent it (a wide enough int will represent it, so will unlimited precision decimals, wide enough binary floats -- of standard formats, IEEE 754 binary128 works -- etc.).
sevensor · 17h ago
RFC 8259 is a good read and I wish more people would make the effort. I really don’t mean to bash JSON here. It was a great idea and it continues to be a great idea, especially if you are using javascript. However, the passage you quote illustrates the same shortcoming I’m complaining about: RFC 8259 basically says “valid primitive types in json are the valid primitive types in your programming language,” but this results in implementations like Python’s json library emitting invalid tokens like bare NaN, which can cause decoders to choke.

I think what JSON gets right is that it gives us a universal way of expressing structure: arrays and objects map onto basic notions of sequence and association that are useful in many contexts and can be represented in a variety of ways by programming languages. My ideal data interchange format would stop there and let the user decide what to do with the value text after the structure has been decoded.

conartist6 · 16h ago
Yeah I would emit NaN and just hope the receiver handles it.

What's the point of lying about the data?

The format offers you no data type that would not be an outright lie when applied to this data, so you may as well not lie and break the format

sevensor · 13h ago
Your other option is to comply with the spec, emit a string, and expect the receiver to deal with that at validation time rather than parse time.
zzo38computer · 6h ago
> you have only the fundamental datatypes as a requirement

Not really; the set of datatypes has problems. It uses Unicode, not binary data and not non-Unicode text. Numbers are usually interpreted as floating point numbers rather than integers, which can also be a problem. Keys can only be strings. And, other problems. So, the data types are not very good.

And, since it is a text format, it means that escaping is required.

> And while you have to write type converters all the time for datetime, BLOBs etc.

Not having a proper data type for binary means that you will need to encode it using different types and then avoids the benefit of JSON, anyways. So, I think JSON is not as helpful.

I think DER is better (you do not have to use all of the types; only the types that you are using is necessary to be implemented, because the format of DER makes it possible to skip anything that you do not care about), and I made up TER which is text based format which can be converted to DER (so, even though a binary data is represented as text, it is still representing the binary data type, rather than needing to use the wrong data type like JSON does).

> And you can complain or explain with JSON: "Comments not a feature?! WTF!" - Add a field with the key "comment"

But then it is a part of the data, which you might not want.

elcritch · 11h ago
CBOR (and MsgPack) still embraces that simplicity. It provides the same types of key-value, lists, and basic values.

However the types are more precise allowing you to differentiate between int32’s or int64’s or between strings or bytes.

Essentially you can replace json with it and gain performance, less ambiguity but with the same flexibility. You do need a step to print CBOR in human readable form, but it has a standardized human readable form similar to a typed json.

dang · 1d ago
Related. Others?

Begrudgingly Choosing CBOR over MessagePack - https://news.ycombinator.com/item?id=43229259 - March 2025 (78 comments)

MessagePack vs. CBOR (RFC7049) - https://news.ycombinator.com/item?id=23838565 - July 2020 (2 comments)

CBOR – Concise Binary Object Representation - https://news.ycombinator.com/item?id=20603378 - Aug 2019 (71 comments)

CBOR – Concise Binary Object Representation - https://news.ycombinator.com/item?id=10995726 - Jan 2016 (36 comments)

Libcbor – CBOR implementation for C and others - https://news.ycombinator.com/item?id=9597198 - May 2015 (5 comments)

CBOR – A new object encoding format - https://news.ycombinator.com/item?id=6932089 - Dec 2013 (9 comments)

RFC 7049 - Concise Binary Object Representation (CBOR) - https://news.ycombinator.com/item?id=6632576 - Oct 2013 (52 comments)

zzo38computer · 6h ago
I prefer DER, which is also a binary format so it has the advantages of binary formats, too. (There is also BER, but in my opinion, DER is better.) I use DER in some programs, if the structured data format is useful. (Also, since text format is sometimes useful too, I had made up TER which is intended to be converted to DER. The DER file can be made in other ways as well and it is not required to use TER.)

(Also, standard ASN.1 does not have a key/value list type (which JSON and CBOR do have), but I had made up some nonstandard extensions to ASN.1 (called ASN.1X), including a few additional types, one of which is the key/value list type. Due to this, ASN.1X can now make a superset of the data that can be made by JSON (the only new type that is needed for this is the key/value list type; the other types of JSON are already standard ASN.1 types).)

brookst · 1d ago
Odd that the XML and JSON sections show examples of the format, but CBOR doesn’t. I’m left with no idea what it looks like, other than “building on JSON’s key/value format”.
cbm-vic-20 · 1d ago
There's an example in the "Putting it Together" section, showing JSON, a "human readable" representation of CBOR, and the hexidecimal bytes of CBOR.

https://cborbook.com/part_1/practical_introduction_to_cbor.h...

account-5 · 1d ago
I'm assuming, since it's a binary encoded, the textual output would not be something you'd like to look at.
brookst · 1d ago
Why? I’m comfortable reading 0x48 0x65 0x78 0x61 0x64 0x65 0x63 0x69 0x6D 0x61 0x6C
8n4vidtmkvmk · 1d ago
With a table explaining what the byte codes mean? Absolutely I want to see that.
sam_lowry_ · 1d ago
People look at TCP packets all the time.
account-5 · 1d ago
In which format? As a list of 1s and 0s; in hex? TCP or IP if I just pasted the textual version of any binary data id captured without some form of conversion it's not good to look at. Especially if it's not accompanied by the encoding schema so you can actually make sense of it.
brookst · 1d ago
The encoding schema are also present for XML and JSON, but not CBOR, so yes, that’s another gap if the book is intended for a technical audience.
makapuf · 1d ago
ASN.1 while complex has really seems to be a step up from those (even if older) in terms of terseness (as binary encoding) and generality.
eadmund · 1d ago
Would you rather write a parser for this:

    SEQUENCE {
      SEQUENCE {
        OBJECT IDENTIFIER '1 2 840 113549 1 1 1'
        NULL
        }
      BIT STRING 0 unused bits, encapsulates {
          SEQUENCE {
            INTEGER
              00 EB 11 E7 B4 46 2E 09 BB 3F 90 7E 25 98 BA 2F
              C4 F5 41 92 5D AB BF D8 FF 0B 8E 74 C3 F1 5E 14
              9E 7F B6 14 06 55 18 4D E4 2F 6D DB CD EA 14 2D
              8B F8 3D E9 5E 07 78 1F 98 98 83 24 E2 94 DC DB
              39 2F 82 89 01 45 07 8C 5C 03 79 BB 74 34 FF AC
              04 AD 15 29 E4 C0 4C BD 98 AF F4 B7 6D 3F F1 87
              2F B5 C6 D8 F8 46 47 55 ED F5 71 4E 7E 7A 2D BE
              2E 75 49 F0 BB 12 B8 57 96 F9 3D D3 8A 8F FF 97
              73
            INTEGER 65537
            }
          }
      }
or this:

    (public-key
      (rsa
        (e 65537)
        (n
         165071726774300746220448927123206364028774814791758998398858897954156302007761692873754545479643969345816518330759318956949640997453881810518810470402537189804357876129675511237354284731082047260695951082386841026898616038200651610616199959087780217655249147161066729973643243611871694748249209548180369151859)))
I know that I’d prefer the latter. Yes, we could debate whether the big integer should be a Base64-encoded binary integer or not, but regardless writing a parser for the former is significantly more work.

And let’s not even get started with DER/BER/PEM and all that insanity. Just give me text!

zzo38computer · 6h ago
That is a text format, although DER is a binary format and encodes the data which there is represented by text. I think they should not have made a bit string (or octet string) to encapsulate another ASN.1 data and would be better to put it directly, but nevertheless it can work. The actual data to be parsed will be binary, not the text format like that.

DER is a more restricted variant of BER and I think DER is better than BER. PEM is also DER format but is encoded as base64 and has a header to indicate what type of data is being stored, rather than directly.

flowerthoughts · 21h ago
The ASN.1 notation wasn't meant for parsing. And then people started writing parsing generators for it, so they adapted. However, you're abusing a text format for human reading and pretending it's a serialization format.

The BER/PER are binary formats and great where binary formats are needed. You also have XER (XML) and JER (JSON) if you want text. You can create an s-expr encoding if you want.

Separate ASN.1--the data model from ASN.1--the abstract syntax notation (what you wrote) from ASN.1's encoding formats.

[1] https://www.itu.int/en/ITU-T/asn1/Pages/asn1_project.aspx

eadmund · 18h ago
> However, you're abusing a text format for human reading and pretending it's a serialization format.

They should be the same, in order to facilitate human debugging. And we were discussing ASN.1, not its serialisations. Frankly, I thought that it was fairer to compare the S-expression to ASN.1, because both are human-readable, rather than to an opaque blob like:

    MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDrEee0Ri4Juz+QfiWYui/9UGSXau/2P8LjnTD8V4Unn+2FAZVGE3kL23bzeoULYv4PeleB3gfm
Sure, that blob is far more space-efficient, but it’s also completely opaque without tooling. Think how many XPKI errors over the years have been due to folks being unable to know at a glance what certificates and keys actually say.
jabl · 1d ago
Yes, but that comes from the telecom world. Hence thanks to NIH, that wheel must be reinvented.
nly · 1d ago
The FOSS tooling for it sucks balls. That's why
zzo38computer · 6h ago
Then, work to make a better one. (I had written a C library to read/write DER format, although it does not deal with the schema.)
JoelJacobson · 1d ago
Fun fact: CBOR is used within the WebAuthn (Passkey) protocol.

To do Passkey-verification server-side, I had to implement a pure-SQL/PLpgSQL CBOR parser, out of fear that a C-implementation could crash the PostgreSQL server: https://github.com/truthly/pg-cbor

esbranson · 7h ago
And .Net 5 circa 2020 added support for CBOR. ASP.NET ended up being a good choice for an experimental WebAuthn server for FedCM and DID experiments.
teatro · 1d ago
That’s why I’m wondering if there is an actual CBOR encoder in the browsers? I mean, there must be one, or am I wrong?
nabla9 · 1d ago
CBOR is when you need option for very small code size. If you can always use compression, CBOR provides no significant data size improvement over JSON.

With small code size it beats also BSON, EBML and others.

surajrmal · 1d ago
Or compute. Compression isn't free, especially on power constrained devices. At scale power and compute also have real cost implications. Most data centers have long been using binary encoding formats such as protobuf to save on compute and network bandwidth. cbor is nice because it's self describing so you can still understand it without a schema, which is a nice property people like about json.
8n4vidtmkvmk · 1d ago
Doesn't capn proto win hands down on compute?

I haven't used it, but I thought that was the big claim.

kentonv · 1d ago
Not necessarily.

Cap'n Proto serialization can be a huge win in terms of compute if you are communicating using shared memory or reading huge mmaped files, especially if the reader only cares to read some random subset of the message but not the whole thing.

But in the common use case of sending messages over a network, Cap'n Proto probably isn't a huge difference. Pushing the message through a socket is still O(n), and the benefits of compression might outweigh the CPU cost. (Though at least with Cap'n Proto, you have the option to skip compression. Most formats have some amount of compression baked into the serialization itself.)

Note that benchmarks vary wildly depending on the use case and the type of data being sent, so it's not really possible to say "Well it's N% faster"... it really depends. Sometimes Protobuf wins! You have to test your use case. But most people don't have time to build their code both ways to compare.

I actually think Cap'n Proto's biggest wins are in the RPC system, not the serialization. But these wins are much harder to explain, because it's not about speed, but instead expressiveness. It's really hard to understand the benefits of using a more expressive language until you've really tried it.

(I'm the author of Cap'n Proto.)

Zardoz84 · 1d ago
gzip, deflate, brotli ?
aidenn0 · 1d ago
I admit I got nerd-sniped here, but the table for floats[1] suggests that 10000.0 be represented as a float32. However, isn't it exactly representable as 0x70e2 in float16[2]? There are only 10 significant bits to the mantissa (including the implicit 1), while float16 has 11 so there's even an extra bit to spare.

1: https://cborbook.com/part_1/practical_introduction_to_cbor.h...

2: i.e. 1.220703125×2¹³

aidenn0 · 12h ago
Looks like it's a typo; they state:

> 0x47c35000 encodes 10000.0

But by my math that encodes 100000.0 (note the extra zero).

fjfaase · 1d ago
This is a link to just one section of a larger book. The next section compare CBOR with a number of other binary storage format, such as protobuf.
gethly · 1d ago
I wish browsers would support CBOR natively so I could just return CBOR instead of JSON(++speed --size ==win) and not have to be concerned with decoding it or not being able to debug requests in dev console.
dylan604 · 1d ago
JSON + compression (++speed --size ==win)

your server can do this natively for live data. your browser can decompress natively. and ++human-readable. if you're one of those that doesn't want the user to read the data, then maybe CBOR is attractive??? but why would you send data down the wire that you don't want the user to see? isn't the point of sending the data to the client is so the client can display that data?

gethly · 1d ago
That is true. Basic content encoding works very well with json but that still means there is the compression step, which would not be necessary with CBOR as it is already a binary payload. It would allow faster response and delivery times natively. Of course, we are talking few ms, but I say why leave those ms on the floor?

I guess i'm just shouting at the clouds :D

8n4vidtmkvmk · 1d ago
It's still not attractive to hide data from the user. Unless it's encrypted, the user can read it.
dylan604 · 1d ago
i think i'm using a different meaning of "seeing". to the user, it won't be plain text that is human readable. unencrypted CBOR byte data might as well be encrypted to the end user.
glenjamin · 1d ago
The only mention I can see in this document of compression is

> Significantly smaller than JSON without complex compression

Although compression of JSON could be considered complex, it's also extremely simple in that it's widely used and usually performed in a distinct step - often transparently to a user. Gzip, and increasingly zstd are widely used.

I'd be interested to see a comparison between compressed JSON and CBOR, I'm quite surprised that this hasn't been included.

dylan604 · 1d ago
> I'm quite surprised that this hasn't been included.

Why? That goes against the narrative of promoting one over the other. Nissan doesn't advertise that a Toyota has something they don't. They just pretend it doesn't exist.

JimDabell · 1d ago
Previously:

CBOR – Concise Binary Object Representation - https://news.ycombinator.com/item?id=20603378 - Aug 2019 (71 comments)

Begrudgingly Choosing CBOR over MessagePack - https://news.ycombinator.com/item?id=43229259 - Mar 2025 (78 comments)

johnisgood · 1d ago
Erlang / Elixir has amazing support for ASN.1! I love it.

https://www.erlang.org/doc/apps/asn1/asn1_getting_started.ht...

https://www2.erlang.org/documentation/doc-14/lib/asn1-5.1/do... (https://www2.erlang.org/documentation/doc-14/lib/asn1-5.1/do...)

I am using ASN.1 to communicate between a client (Java / Kotlin) and server (Erlang / Elixir), but unfortunately Java / Kotlin has somewhat of a shitty support for ASN.1 in comparison to Erlang.

zzo38computer · 6h ago
I also use ASN.1 but I use C, so I wrote my own implementation of DER.
ghishadow · 1d ago
Erlang and ASN.1 are from telecom, so it makes sense they have best support
johnisgood · 16h ago
I agree, but that does not mean that other languages should have shitty support. It does not mean that it should not either, of course.
naggumsghost · 1d ago
If GML was an infant, SGML is the bright youngster far exceeds expectations and made its parents too proud, but XML is the drug-addicted gang member who had committed his first murder before he had sex, which was rape.

https://www.schnada.de/grapt/eriknaggum-xmlrant.html

We're going to have to think up something worse for CBOR.

camgunz · 1d ago
Oh good, another CBOR thread. Disclaimer: I wrote and maintain a MessagePack implementation. I've also bird dogged this for a while, HN search me.

Mostly, I just want to offer a gentle critique of this book's comparison with MessagePack [0].

> Encoding Details: CBOR supports indefinite-length arrays and maps (beneficial for streaming when total size is unknown), while MessagePack typically requires fixed collection counts.

This refers to CBOR's indefinite length types, but awkwardly, streaming is a protocol level feature, not a data format level feature. As a result, there's many better options, ranging from "use HTTP" to "simply send more than 1 message". Crucially, CBOR provides no facility for re-syncing a stream in the event of an error, whether that's network or simply a bad encoding. "More features" is not necessarily better.

> Standardization: CBOR is a formal IETF standard (RFC 8949) developed through consensus, whereas MessagePack uses a community-maintained specification. Many view CBOR as a more rigorous standard inspired by MessagePack.

Well, CBOR is MessagePack. Carsten Bormann forked MessagePack, changed some of the tag values, wrote a standard around it, and submitted it to the IETF against the wishes of MessagePack's creators.

> Extensibility: CBOR employs a standardized semantic tag system with an IANA registry for extended types (dates, URIs, bignums). MessagePack uses a simpler but less structured ext type where applications define tag meanings.

Warning: I have a big rant about the tag registry.

The facilities are the same (well, the tag is 8 bytes instead of 1 byte, but w/e); it's TLV all the way down (Bormann ripped this also). Bormann's contribution is the registry, which is bonkers [1]. There's... dozens of extensions there? Hundreds? No CBOR implementation supports anywhere near all this stuff. "Universal Geographical Area Description (GAD) description of velocity"? "ur:request, Transaction Request identifier"?

The registry isn't useful. Here are the possible scenarios:

If something is in high demand and has good support across platforms, then it's a no-brainer to reserve a tag. MP does this with timestamps.

If something is in high demand, but doesn't have good support across platforms, then you're putting extra burden on those platforms. Ex: it's not great if my tiny microcontroller now has to support bignums or 128-bit UUIDs. Maybe you do that, or you make them optional, but that leads us to...

If something isn't in high demand or can't easily be supported across platforms, but you want support for it anyway, there's no need to tell anyone else you're using that thing. You can just use it. That's MP's ext types.

CBOR seems to imagine that there's a hypothetical general purpose decoder out there that you can point to any CBOR API, but there isn't and there never will be. Nothing will support both "Used to mark pointers in PSA Crypto API IPC implementation" and "PlatformV_HAS_PROPERTY" (I just cannot get over this stuff). There is no world where you tell the IETF about your tags, define an API with them, and someone completely independently builds a decoder for them. It will always be a person who cares about your specific tags, in which case, why not just agree on the ext types ahead of time? A COSE decoder doesn't need also need to decode a "RAINS Message".

> Performance and Size: Comparisons vary by implementation and data. CBOR prioritizes small codec size (for constrained devices) alongside message compactness, while MessagePack focuses primarily on message size and speed.

I can't say I fully understand what this means, but CBOR and MP are equivalent here, because CBOR is MP.

> Conceptual Simplicity: MessagePack's shorter specification appears simpler, but CBOR's unification of types under its major type/additional info system and tag mechanism offers conceptual clarity.

Even if there's some subjectivity around "conceptual simplicity/clarity", again CBOR and MP are equivalent here because they're functionally the same format.

---

I have some notes about the blurb above too:

> MessagePack delivers greater efficiency than JSON

I think it's probably true that the fastest JSON encoders/decoders are faster than the fastest MP encoders/decoders. Not that JSON performance has a higher ceiling, but it's got gazillions of engineering hours poured into it, and rightly so. JSON is also usually compressed, so space benefits only matter at the perimeters. I'm not saying there's no case for MP/CBOR/etc., just that the efficienty/etc. gap is a lot smaller than one would predict.

> However, MessagePack sacrifices human-readability

This, of course, applies to CBOR as well.

> ext mechanism provides less structure than CBOR's IANA-registered tags

Again the mechanism is the same, only the registry is different.

[0]: https://cborbook.com/introduction/cbor_vs_the_other_guys.htm...

[1]: https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml

zzo38computer · 5h ago
> This refers to CBOR's indefinite length types, but awkwardly, streaming is a protocol level feature, not a data format level feature.

BER also has indefinite length as well as definite length, but the way that it is doing, is not very good (DER only uses definite length). I think it is more helpful to use a different format when streaming with indefinite length is require, so I made up DSER (and SDSER) which is working as follows:

- The type, which is encoded same as DER.

- If it is constructed, all items it contains come next (the length is omitted).

- If it is primitive, zero or more segments, each of which starts with one byte in range 0x01 to 0xFF telling how many bytes of data are in that segment. (The value is then just the concatenation of all segments together.)

- For both primitive and constructed, one byte with value 0x00 is the termination code.

> Bormann's contribution is the registry, which is bonkers [1]. There's... dozens of extensions there? Hundreds? No CBOR implementation supports anywhere near all this stuff.

It should not need to support all of that stuff; you will only use the ones that are relevant for your program. (There is also the similar kind of complaint with ASN.1, and the similar response that I had made.)

> If something is in high demand, but doesn't have good support across platforms, then you're putting extra burden on those platforms. Ex: it's not great if my tiny microcontroller now has to support bignums or 128-bit UUIDs.

Although it is a valid concern, you would use data which does not have numbers bigger than you need to be, so it can avoid such a problem. You can treat UUIDs like octet strings, although if you only need small numbers then you should use the small numbers types instead, anyways.

> If something isn't in high demand or can't easily be supported across platforms, but you want support for it anyway, there's no need to tell anyone else you're using that thing.

Sometimes it is useful to tell someone else that you are using that thing, although often it is unnecessary, like you said.

aidenn0 · 1d ago
> ...awkwardly, streaming is a protocol level feature, not a data format level feature.

Indeed. I recall that tnetstrings were intentionally made non-streamable to discourage people from trying to do so: "If you need to send 1000 DVDs, don't try to encode them in 1 tnetstring payload, instead send them as a sequence of tnetstrings as payload chunks with checks and headers like most other protocols"

> Warning: I have a big rant about the tag registry. > ...

I completely agree with your rant w.r.t. automated decoding. However, a global tag registry can still potentially be useful in that, given CBOR encoded data with a tag that my decoder doesn't support, it may be easier for a human to infer the intended meaning. Some types may be very obvious, others less so.

e.g. Standardized MIME types are useful even if no application supports every one of them.

camgunz · 19h ago
> However, a global tag registry can still potentially be useful in that, given CBOR encoded data with a tag that my decoder doesn't support, it may be easier for a human to infer the intended meaning.

Yeah if MP is conservative and CBOR is progressive, I'm slightly less conservative than MP: I'd support UUIDs and bignums. But again, they'd have to be very optional, like in the "we're only reserving these tags, not in any way mandating support" sense.

elcritch · 10h ago
> Well, CBOR is MessagePack. Carsten Bormann forked MessagePack

Sure, that’s sort of true but missing context. Bormann (and others) wanted to add things such as separate string and byte sequence types. The MessagePack creator refused for years. Fair enough it’s his format. But it frustrated the community dealing with string vs bytes issues. It also highlights a core philosophical difference of a mostly closed spec vs an extensible first one.

> changed some of the tag values, wrote a standard around it, and submitted it to the IETF against the wishes of MessagePack's creators.

That’s just incorrect and a childish way to view it in my opinion.

The core philosophy and mental models are different in key aspects.

MessagePack is designed as a small self mostly closed format. It uses a simple TLV format with a couple hundred possible user extensions and some clever optimizations. The MP “spec” focuses on this.

CBOR re-envisioned the core idea of MessagePack from the ground up as an extensible major/minor tag system. It’s debatable how much CBOR is a fork of MPack vs a new format with similarities.

The resulting binary output is pretty similar with similar benefits but the core theoretical models are pretty different. The IETF standard bares little to no resemblance to the MessagePack specification.

> The facilities are the same (well, the tag is 8 bytes instead of 1 byte, but w/e); it's TLV all the way down (Bormann ripped this also).

The whole point of CBOR is that the tags go from 1-8 bytes. The parser designs end up fairly different due to the different tag formats. I’ve written and ported parsers for both.

It’s not like the MessagePack creator invented TLV formats either. He just created an efficient and elegant one that’s pretty general. No one says he ripped off “TLV”.

You can’t just take a message pack parser and turn it into a CBOR one by changing some values. I’ve tried and it turns out poorly and doesn't support much of CBOR.

> This refers to CBOR's indefinite length types, but awkwardly, streaming is a protocol level feature, not a data format level feature.

The indefinite length format is very useful for embedded space. I’ve hit limits with MessagePack before on embedded projects because you need to know the length of an array upfront. I wished I’d had CBOR instead.

This can also be useful for data processing applications. For example streaming the conversion of a large XML file into a more concise CBOR format would be much more memory efficient. For large scale that’s pretty handy.

> > However, MessagePack sacrifices human-readability > This, of course, applies to CBOR as well.

For the binary format yes. However the CBOR specification defines an official human readable text format for debugging and documentation purposes. It also defines a schema system like json-schema but for CBOR.

Turns out “just some specs” can actually be pretty valuable.

camgunz · 6h ago
I am really glad you replied.

> Sure, that’s sort of true but missing context. Bormann (and others) wanted to add things such as separate string and byte sequence types. The MessagePack creator refused for years. Fair enough it’s his format. But it frustrated the community dealing with string vs bytes issues.

msgpack-ruby added string support less than a month after cbor-ruby's first commit [0] [1]. The spec was updated over two months before [2]. Awful lot of work if this were really just about strings.

> It also highlights a core philosophical difference of a mostly closed spec vs an extensible first one.

MP has been always been extensible, via ext types.

> That’s just incorrect

I am entirely correct [3].

> MessagePack is designed as a small self mostly closed format.

Isn't it a lot of effort to get an IETF standard changed? Isn't that the benefit of a standard? You keep saying "mostly closed" like it's bad. Data format standards in particular really shouldn't change: who knows how many zettagottabytes there are stored in previous versions?

> It’s debatable how much CBOR is a fork of MPack vs a new format with similarities.

cbor-ruby is literally a fork of msgpack-ruby. The initial commit [0] contains headers like:

    /\*
     \* CBOR for Ruby
     \*
     \* Copyright (C) 2013 Carsten Bormann
     \*
     \*    Licensed under the Apache License, Version 2.0 (the "License").
     \*
     \* Based on:
     \*\*\*\*\**/
    /*
     \* MessagePack for Ruby
     \*
     \* Copyright (C) 2008-2013 Sadayuki Furuhashi
> The resulting binary output is pretty similar with similar benefits

This is the whole game isn't it? The binary output is pretty similar? These are binary output formats!

> but the core theoretical models are pretty different.

I think you're giving a little too much credence to the "theoretical model". It's not more elegant to do what cbor-ruby does [4] vs. what MP does [5] (this is my lib). I literally just use the tag value, or for fixed values I OR them together. The format is designed for you to do this. What's more elegant than a simple, predefined value?

> The whole point of CBOR is that the tags go from 1-8 bytes.

The tags themselves are only 1 byte, until you get to extension types.

> The parser designs end up fairly different due to the different tag formats.

The creator of CBOR disagrees: cbor-ruby was a fork of msgpack-ruby with the tag values changed.

> No one says he ripped off “TLV”.

Don't conflate the general approach with literally forking an existing project.

> You can’t just take a message pack parser and turn it into a CBOR one by changing some values.

This is a strawman. My claim has been about the origins of CBOR, not how one can transmute an MP codec to a CBOR codec.

> I’ve hit limits with MessagePack before on embedded projects because you need to know the length of an array upfront.

When everything's fine, sure this works. If there are any problems whatsoever, you're totally screwed. Any protocol that supports streaming handles this kind of thing. CBOR doesn't. That's bad!

> For example streaming the conversion of a large XML file into a more concise CBOR format would be much more memory efficient.

It's probably faster to feed it through zstd. Also I think you underestimate how involved it'd be to round-trip a rich XML document to/from CBOR/MP.

> However the CBOR specification defines an official human readable text format for debugging and documentation purposes.

Where? Are you talking about Diagnostic Notation [6]? Hmm:

"Note that this truly is a diagnostic format; it is not meant to be parsed. Therefore, no formal definition (as in ABNF) is given in this document. (Implementers looking for a text-based format for representing CBOR data items in configuration files may also want to consider YAML [YAML].)"

YAML!? Anyway, it literally doesn't define it.

[0]: https://github.com/msgpack/msgpack-ruby/commit/60e846aaaa638...

[1]: https://github.com/cabo/cbor-ruby/commit/5aebd764c3a92d40592...

[2]: https://github.com/msgpack/msgpack/commit/5dde8c4fd0010e1435...

[3]: https://github.com/msgpack/msgpack/issues/129#issuecomment-1...

[4]: https://github.com/cabo/cbor-ruby/blob/5aebd764c3a92d4059236...

[5]: https://github.com/camgunz/cmp/blob/master/cmp.c#L30

[6]: https://www.rfc-editor.org/rfc/rfc8949.html#name-diagnostic-...

lofaszvanitt · 15h ago
Yeah, CBOR is very good, but you will only understand after you digest 100000 words about it :D.
darthrupert · 1d ago
CBOR has always seemed to me like the most promising data format for efficient data transfer. Somewhat weird how little use it has.
otterley · 1d ago
AWS is beginning to support it, starting with certain data-heavy APIs: https://aws.amazon.com/about-aws/whats-new/2025/07/amazon-cl...
naikrovek · 1d ago
people are just straight up afraid to write their own binary formats, aren't they.

it's not hard, it's exactly like creating your own text format but you write binary data instead of text, and you can't read it with your eyes right away (but you can after you've looked at enough of it.) there is nothing to fear or to even worry about; just try it. look up how things like TLV work on wikipedia. you can do just about anything you would ever need with plain binary TLV and it's gonna perform like you wouldn't believe.

https://en.wikipedia.org/wiki/Type%E2%80%93length%E2%80%93va...

binary formats are always going to be 1-2 orders of magnitude faster than plain text formats, no matter which plain text format you're using. writing a viewer so you can easily read the data isn't zero-effort like it is for JSON or XML where any existing text editor will do, but it's not exactly hard, either. your binary format reading code is the core of what that viewer would be.

once you write and use your own binary format, existing binary formats you come across become a lot less opaque, and it starts to feel like you're developing a mild superpower.

markisus · 1d ago
CBOR has some stuff that is nice but would be annoying to reimplement. Like using more bytes to store large numbers than small ones. If you need a quick multipurpose binary format, CBOR is pretty good. The only alternative I’d make manually is just memcpy the bytes of a C struct directly to disk and hope that I won’t encounter a system with different endianness.
neutrinobro · 1d ago
These days you don't have to worry about endianness much (unless you dealing with raw network packets). However, you do need to worry about byte-padding. Different compilers/systems will place byte padding between items in your struct differently (depending on the contents and ordering of items), and if you are not careful the in-memory or on-disk placement of struct data elements can be misaligned on different systems. Most systems align to a 8-byte boundary, but that isn't guaranteed.
markisus · 23h ago
Yeah I try to make sure I do the extern c. I’m also on x86 so I just pretend that alignment is not an issue and I think it works.
hvb2 · 1d ago
I assume you mean as an exercise? Not for actual use in any production system?

If you did mean for production use, I assume you also implement your own encryption, encoding schemes and everything else?

naikrovek · 1d ago
i write my own binary formats because they're fast and small. yes, in production. partly because it's just as easy as anything else for me now, partly because it doesn't require any dependencies at all, and partly to show others just how easy it is, because i think people are unnecessarily afraid of this.

no i don't write my own encoding or encryption.

why the hell would anyone use json for everything, and why would someone who doesn't do that earn your derision?

hvb2 · 1d ago
I didn't say anywhere that we should use json for everything.

I think most people would go with something standard and documented. If you work in a team it helps if you can hire people that are familiar with tech or can read up on it easily.

And in general, unless you can show that your formatter is an actual hot path in need of optimization, you've just added another piece of code in need of care and feeding for no real gain.

Most devs/applications are fine with protobuf or even Json performance. And solving that problem is not something they can or should do.

If you write something like that just to prove a point, good for you. Also I would never want to be on the same team

kookamamie · 1d ago
The article reads like a semi-slop with its numerous lists and overly long explanations of obvious things, such as how XML came to be.
Jean-Papoulos · 1d ago
zbendefy · 1d ago
How different is CBOR compared to BSON? Both seem to be binary json-like representations.

Edit: BSON seems to contain more data types than JSON, and as such it is more complex, whereas CBOR doesn't add to JSON's existing structure.

EdSchouten · 1d ago
That's not entirely true: with CBOR you can add custom data types through custom tags. A central registry of them is here:

https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml

This is, for example, used by IPLD (https://ipld.io) to express references between objects through native types (https://github.com/ipld/cid-cbor/).

maxbond · 1d ago
I think parsing BSON is simpler than parsing JSON, BSON has additional types but the top level is always a document. Whereas the following are all valid JSON:

- `null`

- `"hello"`

- `[1,2,NaN]`

Additionally, BSON will just tell you what the type of a field is. JSON requires inferring it.

zokier · 1d ago
NaN is not part of JSON by any spec. Top level scalar values were disallowed by RFC 4627.
maxbond · 1d ago
Fair enough. I'm not sure how much JSON parsers in the wild care about that spec. I just tried with Python and it was happy to accept scalars and NaN. JavaScript rejected NaN but was happy to accept a scalar. But sure, compliant parsers can disregard those cases.