Is something wrong with binary formats?

October 20, 2024

For quite some time I was wondering why are text formats so popular while binary formats often struggle. Even for graphics there is PPM, and JSON and HTTP¹ are now ubiquitous. While there are very popular binary formats like ELF and JPEG they are also very specialized, while less specialized formats like MsgPack and ProtoBuf are far less common.

It was stated this is rooted in human capabilities: our mental recognition works well with text including text-based formats yet is totally unsuitable for binary data. While I do agree with that the core problem is subtler in my opinion.

One part is human-machine interface (i.e. tooling). Image having half of your text files encoded in EBCDIC and some others in word-swapped UTF-32. Now suddenly, your mental recognition is of no use as your text editor just can’t open that and leaves you looking at 41 e2 c3 c9 c9 in the hex editor...

That’s not enough for a problem though; after all if we have good tooling for text why can’t we have the same for binary? But the problem here is one detail that is so obvious it is often omitted:

Text is binary.

Text is not an alternative to binary, it’s a binary format of its own. Or rather, a text encoding is; there are many but, what they encode is the very same notion of text, just like there are many image formats but what they encode is the same notion of pixels².

Now, that may or may not be obvious but it doesn’t look like a problem. But the more interesting part is a corollary:

Binary formats (like MsgPack) aren’t siblings of text formats (like JSON); they are siblings of the text format itself, of the venerable ASCII!

This cuts it clear. Competing with XML and JSON is one thing but competing with ASCII is just another level entirely. No wonder there is no accepted universal binary format: there actually is one, and it’s called “text”.

I think this is why EBCDIC has died. It is also probably a major reason Unicode succeeded despite all the pain: while for human-targeted data it may be beneficial to have tailored formats for different needs (scripts), for machine communication benefits of a single universal format are so enormous UTF-8 was accepted nevertheless.

¹ Okay, HTTP 2 is not text-based anymore. Except it is, in a sense: it still uses textual key-value pairs, only wire-level encoding is a bit more complicated.

² For raster image formats only, obviously.