I prefer human-readable file formats

62 Bogdanp 49 8/9/2025, 9:13:48 AM adele.pollux.casa ↗

Comments (49)

mxmlnkn · 1h ago

I concur with most of these arguments, especially about longevity. But, this only applies to smallish files like configurations because I don't agree with the last paragraph regarding its efficiency.

I have had to work with large 1GB+ JSON files, and it is not fun. Amazing projects such as jsoncons for streaming JSONs, and simdjson, for parsing JSON with SIMD, exist, but as far as I know, the latter still does not support streaming and even has an open issue for files larger than 4 GiB. So you cannot have streaming for memory efficiency and SIMD-parsing for computational efficiency at the same time. You want streaming because holding the whole JSON in memory is wasteful and sometimes not even possible. JSONL tries to change the format to fix that, but now you have another format that you need to support.

I was also contemplating the mentioned formats for another project, but they are hardly usable when you need to store binary data, such as images, compressed data, or simply arbitrary data. Storing binary data as base64 strings seems wasteful. Random access into these files is also an issue, depending on the use case. Sometimes it would be a nice feature to jump over some data, but for JSON, you cannot do that without parsing everything in search of the closing bracket or quotes, accounting for escaped brackets and quotes, and nesting.

jerf · 7m ago

My rule of thumb that has been surprisingly robust over several uses of it is that if you gzip a JSON format you can expect it to shrink by a factor of about 15.

That is not the hallmark of a space-efficient file format.

Between repeated string keys and frequently repeated string values, that are often quite large due to being "human readable", it adds up fast.

"I was also contemplating the mentioned formats for another project, but they are hardly usable when you need to store binary data, such as images, compressed data, or simply arbitrary data."

One trick you can use is to prefix a file with some JSON or other readable value, then dump the binary afterwards. The JSON can have offsets into the binary as necessary for identifying things or labeling whether or not it is compressed or whatever. This often largely mitigates the inefficiency concerns because if you've got a big pile of binary data the JSON bloat by percent tends to be much smaller than the payload; if it isn't, then of course I don't recommend this.

andreypopp · 1h ago

try clickhouse-local, it's amazing how it can crunch JSON/TSV or whatever at great speed

mriet · 1h ago

I can understand this for "small" data, say less than 10 Mb.

In bioinformatics, basically all of the file formats are human-readable/text based. And file sizes range between 1-2Mb and 1 Tb. I regularly encounter 300-600 Gb files.

In this context, human-readable files are ridiculously inefficient, on every axis you can think of (space, parsing, searching, processing, etc.). It's a GD crime against efficiency.

And at that scale, "readable" has no value, since it would take you longer to read the file than 10 lifetimes.

graemep · 1h ago

I do not think the argument is that ALL data should be in human readable form, but I think there are far more cases of data being in a binary form when it would be better human readable. Your example of a case where it is human readable when it should be binary is rarer for most of us.

In some cases human readable data is for interchange and it should be processed and queried in other forms - e.g. CSV files to move data between databases.

An awful lot of data is small - and these days I think you can say small is quite a bit bigger than 10Mb.

Quite a lot of data that is extracted from a large system would be small at that point, and would benefit from being human readable.

The benefit of data being human readable is not necessarily that you will read it all, but that it is easier to read bits that matter when you are debugging.

attractivechaos · 27m ago

> human-readable files are ridiculously inefficient on every axis you can think of (space, parsing, searching, processing, etc.).

In bioinformatics, most large text files are gzip'd. Decompression is a few times slower than proper file parsing in C/C++/Rust. Some pure python parsers can be "ridiculously inefficient" but that is not the fault of human-readability. Binary files are compressed with existing libraries. Compressed binary files are not noticeably faster to parse than compressed text files. Binary formats can be indeed smaller but space-efficienct formats take years to develop and tend to have more compatibility issues.

> And at that scale, "readable" has no value, since it would take you longer to read the file than 10 lifetimes.

You can't read the whole file by eye, but you can (and should often) eyeball small sections in a huge file. For that, you need a human-readable file format. A problem with this field IMHO is that not many people are literally looking at the data by eye.

bregma · 2h ago

I journeyed from fancy commercial bookkeeping systems that changed data formats every few years (with no useful migration) to GNU Cash and finally to Plain-Text Accounting. I can finally get the information I need with easy backups (through VCS) and flexibility (through various tools that transform the data). The focus is on content, not tools or presentation or product.

When I write I write text. I can transform text using various tools to provide various presentations consumable through various products. The focus is on content, not presentation, tools, or product.

I prefer human-readable file formats, and that has only been reinforced over more than 4 decades as a computer professional.

gcarvalho · 38m ago

I have recently migrated ~8y of Apple Numbers spreadsheets (an annoyingly non-portable format) to plaintext accounting.

It took me many hours and a few backtracks to get to a point where I am satisfied with it, and where errors are caught early. I would just suggest anyone starting now to enable --strict --pedantic on ledger-cli from the day 1, and writing asserts for your accounts as well e.g. to check that closed accounts don’t get new entries.

I really miss data entry being easier and not as prone to free-form text editing errors (most common are typos on the amount or copying the wrong source/dest account), but I am confident it matches reality much better than my spreadsheets did.

kjellsbells · 44m ago

Ease of: reading, comprehension, manipulation, short- and long-term retrieval are not the same problems. All file formats are bad at at least one of these.

Given an arbitrary stream of bytes, readability only means the human can inspect the file. We say "text is readable" but that's really only because all our tooling for the last sixty years speaks ASCII and we're very US-centric. Pick up a text file from 1982 and it could be unreadable (EBCDIC, say). Time to break out dd and cross your fingers.

Comprehension breaks down very quickly beyond a few thousand words. No geneticist is loading up a gig of CTAGT... and keeping that in their head as they whiz up and down a genome. Humans have a working set size.

Short term retrieval is excellent for text and a PITA for everything else. Raise your hand if you've gotten a stream of bytes, thrown file(1) at it, then strings(1), and then resorted to od or picking through the bytes.

Long term retrieval sucks for everyone. Even textfiles. After all, a string of bytes has no intrinsic meaning except what the operating system and the application give it. So who knows if people in 2075 will recognise "48 65 6C 6C 6F 20 48 4E 21"?

wizzwizz4 · 33m ago

I decoded that as "Hello HI!" using basic cryptanalysis, the assumption that the alphabet would be mostly contiguous, the assumption that capital and lower-case are separated by a bit, and the knowledge that 0x20 is space and 0x21 is exclamation mark. On a larger text, we wouldn't even need these assumptions: cryptanalysis is sufficiently-powerful, and could even reverse-engineer EBCDIC! (Except, it might be difficult to figure out where the punctuation characters go, without some unambiguous reference such as C source code: commas and question marks are easy, but .![]{} are harder.)

Edit: I can't count. H and I are consecutive in the alphabet, and it actually says "Hello HN!". I think my general point is valid, though.

graphviz · 29m ago

We learned the hard way, for some of us it's all too easy to make careless design errors that become baked-in and can't be fixed in a backward-compatible way (either at the DSL or API level). An example in Graphviz is its handling of backslash in string literals: to escape special characters (like quotes \"), to map special characters (like several flavors of newline with optional justification \n \l \r) and to indicate variables (like node names in labels \N) along with magic code that knows that if the -default- node name is the empty string that actually means \N but if a particular node name is the empty string, then it stays.

There was a published study, Wrangling Messy CSV Files by Detecting Row and Type Patterns by Gerrit J. J. van den Burg, Alfredo Nazábal, and Charles Sutton (Data Mining and Knowledge Discovery, 2019) that showed many pitfalls with parsing CSV files found on GitHub. They achieved 97%. It's easy to write code that slings out some text fields separated by commas, with the objective of using a human-readable portable format.

You can learn even more by allowing autofuzz to test your nice simple code to parse human readable files.

whobre · 47m ago

Even "human-readable" formats are only readable if you have proper tools - i.e. editors or viewers.

If a binary file has a well-known format and tools available to view/edit it, I see zero problems with it.

Too · 1h ago

Let’s say that hypothetically one were to disagree with this. What would be the best alternative format? One that has ample of tooling for editing and diffing, as though it was text, yet stores things more efficiently.

Most of the arguments presented in TFA are about openness, which can still be achieved with standard binary formats and a schema. Hence the problem left to solve is accessibility.

I’m thinking something like parquet, protobuf or sqllite. Despite their popularities, still aren’t trivial for anyone to edit.

kyrra · 27m ago

Protobuf has a text and binary format. https://protobuf.dev/reference/protobuf/textformat-spec/

Google uses it a lot for data dumps for tests or config that can be put into source control.

aldonius · 36m ago

I suppose with SQLite files, you could at least in theory diff their SQL-dump representations, though you'd presumably want a way to canonicalise said representation. In a way I suppose each (VCS) commit is a bit like a database migration.

paulddraper · 37m ago

ZIP archive of XML is used for Office documents

JdeBP · 2h ago

Given that the author mentions CSV and text table formats, the article's list of the "entire Unix toolchain" is significantly impoverished not only by the lack of ex (which is usefully scriptable) but by the lack of mlr.

* https://miller.readthedocs.io/

vis/unvis are fairly important tools for those text tables, too.

Also, FediVerse discussion: https://social.pollux.casa/@adele/statuses/01K1VA9NQSST4KDZP...

hebocon · 2h ago

Wow, I've never heard of 'mlr' before. Looks like a synthesis of Unix tools, jq, and others? Very useful - hopefully it's packaged everywhere for easy access.

mschwaig · 53m ago

Human-readability was one of the aspects that I enjoyed about using CCL,the Categorical Configuration Language (https://chshersh.com/blog/2025-01-06-the-most-elegant-config...), in one of my projects recently.

It saves you from escaping stuff inside of multiline-strings by using meaningful whitespace.

What I did not like about CCL so much that it leaves a bunch of stuff underspecified. You can make lists and comments with it, but YOU have to decide how.

refactor_master · 2h ago

Clearly there’s a very real need for binary data formats, or we wouldn’t have them. For one, it’s much more space efficient. Does the author know how much storage cost in 1985? Or how slow computers were?

If I time traveled back to 1985 and told corporate to adopt CSV because it’d be useful in 50 years when unearthing old customer records I’d be laughed out of the cigar lounge.

graemep · 1h ago

Except there are many things for which we used human readable formats in the 1980s for which we use binary formats now - HTTP headers, for example.

CSV was definitely in wide use back then.

Text formats are compressible.

self_awareness · 58m ago

Text formats are compressible because they waste a lot of space to encode data. Instead of the space of 256 values per byte they use maybe 100.

graemep · 49m ago

I assumed that is common knowledge here. The point is that you need to take that into account when discussing storage requirements.

burnt-resistor · 1h ago

I guess you've never used UNIX or understood the philosophy.

https://en.wikipedia.org/wiki/Unix_philosophy

There already exist a bazillion binary serialization formats: protobufs, thrift, msgpack, capnproto, etc. but these all suffer from human inaccessibility. Generally, they should be used only when performance becomes a severe limiting factor but never before or it's likely a sign of premature optimization.

tliltocatl · 49m ago

It's often too late to overhaul you systems when performance becomes a serve limiting factor. By that point things like data format are already set in stone. The whole "premature optimization" was originally about peep-hole stuff, not architecture-defining concerns, and it's really sad to see it misapplied to "lets store everything as json and use O(n²) everywhere and hopefully it will be someone else's problem".

adregan · 1h ago

Are there any binary formats that include the specification in the format itself?

xandrius · 56m ago

Don't most binary format must have some specification somewhere (either private or public)?

Unless someone just decided to shove random stuff in binary mode and call it a day?

huhtenberg · 58m ago

https://en.wikipedia.org/wiki/ASN.1

kamatour · 1h ago

Readable files are great… until they’re 1TB and you just want to cry.

qiine · 59m ago

1TB of perfectly readable, human despair.

LoganDark · 1h ago

To be fair, nothing's great when I want to cry.

IanCal · 2h ago

> Unlike binary formats or database dumps, these files don't hide their meaning behind layers of abstraction. They're built for clarity, for resilience, and for people who like to know what's going on under the hood.

Csv files hide their meaning in external documentation or someone’s head, are extremely unclear in many cases (is this a number or a string? A date?) and is extremely fragile when it comes to people editing them in text editors. They entirely lack checks and verification at the most basic level and worse still they’re often but not always perfectly line based. Many tools then work fine until they completely break you file and you won’t even know. Until I get the file and tell you I guess.

I’ve spent years fixing issues introduced by people editing them like they’re text.

If you’ve got to use tools to not completely bugger them then you might as well use a good format.

fireflash38 · 1h ago

If you're reading in data, you need to parse and verify it anyway.

IanCal · 1h ago

Which you might not be able to do after it’s been broken silently.

fireflash38 · 1h ago

That's still an issue with binary files too, and you can't even look at them to fix.

burnt-resistor · 1h ago

They're standardized[0], so it's only stupid humans screwing them up.

Maybe you need a database or an app rather than flat files.

0. https://www.ietf.org/rfc/rfc4180.txt

IanCal · 1h ago

That came far after csv files started being used and many parsers don’t follow the spec. Even if they do, editing the file manually can easily and silent break it - my criticisms are of entirely valid to the new spec files. The wide range of ways people make csvs is a whole other thing I’ve spent years fixing.

It’s not about the stupidity of the humans, and if it was then planning for “no stupid people” is even stupider than those messing up the files.

> Maybe you need a database or an app rather than flat files.

Flat files are great. What’s needed are good file formats.

burnt-resistor · 1h ago

TOML

What's the problem?

integralid · 55m ago

But TOML is not a good file format. Quite the opposite actually.

https://hitchdev.com/strictyaml/why-not/toml/

IanCal · 1h ago

What are you trying to ask? I don’t understand. I’m not talking about toml.

burnt-resistor · 1h ago

I gave you a good text file format. You're acting like there are no good file formats. Either invent a domain-specific one, use a standard one, or use a different modality rather than complain that a utopia you won't bother to create doesn't exist.

Someone · 1h ago

> They're standardized[0]

From that article:

“This memo […] does not specify an Internet standard of any kind”

and

“Interoperability considerations:

Due to lack of a single specification, there are considerable differences among implementations. Implementors should "be conservative in what you do, be liberal in what you accept from others" (RFC 793 [8]) when processing CSV files”

burnt-resistor · 1h ago

Are you AI? I was replying to a comment, not the article.

Also, you're quoting me to myself: https://news.ycombinator.com/item?id=44837879

codr7 · 1h ago

I'll take sexprs over CSV/JSON/YAML/XML any day.

self_awareness · 1h ago

I'm not sure the author knows much about binary formats.

Binary formats are binary for a reason. Speed of interpretation is one reason. Usage of memory is another reason. Directly mapping it and using it, is another reason. Binary formats can make assumptions about system memory page size. They can store internal offsets to make incremental reading faster. None of this is offered by text formats.

Also, the ability to modify text formats is completely wrong. Nothing can be changed if we introduce checksums inside text formats. Also if we digitally sign a format, then nothing can be changed despite the fact that it's a text format.

Also, comparing CSV files to internal database binary format? It's like comparing a book cover to the ERP system of a library. Meaning, it's comparing two completely different things.

paulddraper · 34m ago

You want JARs to be human-readable? PNGs? MP3s?

I think the author is thinking about a very narrow set of files.

ape4 · 1h ago

Lets hear it for RTF for documents

rickcarlino · 4h ago

Do you have the Gemini:// URL? I’m getting a URL resolution error.

rizky05 · 3h ago

gemini://adele.pollux.casa/gemlog/2025-08-04_why_I_prefer_human-readble_file_formats.gmi

I made an AI clone of my dead son – and let a journalist interview him (bbc.com)

Show HN: Vertical travel search for Dieng – intent router and curated pages (explorewonosobo.com)

Nintendo patent potentially adds click wheel and crank accessories to Joy-Con (notebookcheck.net)

Reuters reports that the entry-level software eng job market has collapsed (reuters.com)

Show HN: Twisted Tangle: A Free 3D Puzzle Game to Untangle Knots Online (mergebrainrot.com)

Scientists may finally know why the first stars in the universe left no trace (livescience.com)

The Future of Product Management Is AI-Native (oreilly.com)

Genius, Rejected: Emergent Ventures versus the System (marginalrevolution.com)

A Big Day Launching My SaaS

The conservative women who are having it all (wsj.com)

The Wedding Wars (substack.com)

Herasight Embryo Screening Tool (herasight.com)

Could They Have Met? (dxdt.ch)

Show HN: Create art your way with Flux Lora AI (flux-lora.org)

The dead need right to delete their data so they can't be AI-ified, lawyer says (theregister.com)

OpenFreeMap survived 100k requests per second (blog.hyperknot.com)

Claude admits it should stop operating (claude.ai)

Internal Coherence Maximization(ICM): Label-Free Unsupervised Training Framework (github.com)

Claude would rather be known as an NBI, not an AI (claude.ai)

Show HN: The current sky at your approximate location, as a CSS gradient (sky.dlazaro.ca)

Gpt5 400: Your organization must be verified to stream this model (old.reddit.com)

Neighborhood Determinants of Primary Care Access in Virginia (annfammed.org)

Intermittent fasting strategies and their effects on body weight (bmj.com)

Spaces – Meaningful connections, driven by you (spacesone.com)

Why am I not producing AI slop?

Gerrymandering by Both Parties Is Deepening America's Divide (wsj.com)

Werr

An ancient archaeological site meets conspiracy theories – and Joe Rogan (npr.org)

Show HN: I made a Ruby on Rails-like framework in PHP (Still in progress) (github.com)

Doge-Pilled (bloomberg.com)

Long-term exposure to outdoor air pollution linked to increased risk of dementia (cam.ac.uk)

AGI is not coming – Yannic Kilcher (youtube.com)

New adhesive surface modeled on a remora works underwater (arstechnica.com)

Stanford to continue legacy admissions and withdraw from Cal Grants (forbes.com)

Rich-syntax string formatter in TypeScript (github.com)

Exposing Satcom in the Sky: Aircraft Systems Vulnerable to Remote Attacks

Show HN: Dalle 3 AI turns words into vivid pictures fast (dalle-3.com)

Countries with most GPT-5 users, esp. in advanced computation and reasoning?

Local LLM Hardware in 2025: prices and token per second [video] (youtube.com)

Show HN: I made a Google images clone for Pixiv (Japanese art website) (onegai.moe)

US-French SWOT Satellite Measures Tsunami After Quake (jpl.nasa.gov)

Machine learning highlights factors associated with Arabidopsis circadian clock (nature.com)

Constant-traffic padded and encrypted network tunnel (github.com)

It's not detection, it's verification (clipcert.com)

Private Welsh island with 19th century fort goes on the market (cnn.com)

Yet Another LLM Rant (overengineer.dev)

Yes, the referee might be biased. Discipline in English football (blog.engora.com)

R0ML's Ratio (blog.glyph.im)

A subtle bug with Go's errgroup (gaultier.github.io)

Ohyaml.wtf: How good is your knowledge of YAML? (ohyaml.wtf)

I prefer human-readable file formats

Comments (49)