An almost catastrophic OpenZFS bug and the humans that made it

48 r4um 59 7/11/2025, 6:08:30 AM despairlabs.com ↗

Comments (59)

Cthulhu_ · 1d ago

Wouldn't this have been caught by an exhaustive unit test or even a fuzz test? I don't know what kind of testing practices are applied to projects like zfs, nor what kinds or amounts of tests functions like this are subject to, but I would imagine that especially for low-level functions like this, unit tests with all of the known edge cases would be written.

(yes this is very much an armchair opinion, I mostly do front-end JS for a living)

bspammer · 23h ago

Yeah that surprised me too - I would have assumed that ZFS had a bunch of "store and retrieve" tests in many different configurations that would have caught this.

4gotunameagain · 22h ago

Does it even need an exhaustive unit test?

A single call where the psize is different than the asize would have caught it.

jpfr · 1d ago

The problems with C are real.

At the same time, the tooling has gotten much better in the last years.

Clang-analyzer is fast enough to run as part of the CI. Newer gcc also give quite a few more warnings for unused results.

My recommendation to the project is to

- Remove all compiler warnings and enable warning-as-error

- Increase the coverage of unit tests to >80%

That is a lot of work. But that's what is required for high-criticality systems engineering.

rollcat · 19h ago

I hate to be the that OpenBSD guy, but "the people who do the work are the ones to decide how it's done". Yes, people are paid to maintain OpenZFS, but so far nobody is ready to pay for (or volunteer to) meet your bar.

Side note: OpenZFS already has an extensive test suite. Merely hitting a code branch wouldn't have caught this one.

Nokinside · 23h ago

Software verification tools based on abstract Interpretation are really good today.

If you want free software I recommend IKOS - a is a sound static analyzer for C/C++ developed at NASA. Checks: https://github.com/NASA-SW-VnV/ikos/blob/master/analyzer/REA... Numerical abstract domains: https://github.com/NASA-SW-VnV/ikos/blob/master/analyzer/REA...

Commercial tool like Astree https://www.absint.com/astree/index.htm if you have money.

topspin · 1d ago

I found it. All that tells you is that it's a simple problem. Had I not been told it was broken I likely would not have.

It's the kind of bug that makes you stop breathing for a brief moment. So I ran this function through Gemini 2.5 Pro, ChatGPT o3 and Grok 3. No context other than the source code fragment was provided. All three also clearly identified the problem.

There is no longer a reason for trivial flaws such as this surviving to release code. We're past that now, and that's an astonishing thing.

These are the questions I ponder: Should we consider the possibility that the ongoing, incomplete and decades long pursuit of "better" languages is now misguided? Or, that perhaps "better" might now mean whatever properties make code easier for LLMs to analyze and validate against specifications, as opposed to merely preventing the flaws humans are prone to make?

piker · 23h ago

No way is it practical to run complex code like this through a sycophantic parrot. Try it again with some old, well known code and see how many “Certainly! Your error is …” you get.

perching_aix · 22h ago

There was an article here about a month or so ago doing exactly that unpractical thing against I think NFS, and finding a small handful of confirmed vulnerabilities that way. Wish granted I suppose?

piker · 22h ago

Fine, but TFA suggests that even -Wunused is too impractical. Taking that at face value, it's not clear how "Hey use this thing that has an even higher latency and lower signal-to-noise ratio" is a solution.

perching_aix · 22h ago

Well yeah, you're definitely not going to hear me suggest sending a codebase through LLMs sooner than suggest promoting warnings into errors or using well established tools, that's for sure :p

spacecadet · 22h ago

Doesnt make one much more than a sycophantic parrot when they also repeat platitudes without anything else to add.

kalaksi · 23h ago

> whatever properties make code easier for LLMs to analyze

So double down on languages that are more represented in training data? I think it's still worthwhile to actually improve programming languages. Relying only on LLMs is fragile. So ideally do both: improve language and AI.

topspin · 23h ago

> So double down on languages that are more represented in training data?

The pragmatic answer to that is: this appears to be highly effective.

What I have in mind is something else, however. Consider the safety nets we try to build with, for example, elaborate type systems. These new analysis tools can operate on enormous contexts and infer and enforce "type" with a capacity far greater what a human mind can hope to approach. So perhaps there is little value to complex types. Instead, a simple type system supported by specification will be the superior approach. seL4 is an example of the concept: C, but exhaustively specified and verified.

kccqzy · 20h ago

In my experience good type systems help human programmers and AI agents alike. AI can only infer "type" probabilistically and haphazardly. It is far from able to enforce anything. A good type system helps both humans and AI agents by correcting their mistakes sooner.

kccqzy · 20h ago

The pursuit of "better" languages might be incomplete, but it has long been known that this specific problem can be solved by the newtype pattern. It might have originated in Haskell but it's equally applicable in more traditional languages like C++. All you need is to define a named struct containing an integer called PhysicalSize and another one called AllocatedSize. In C++ add operator overloading to it so you can do arithmetic. In Haskell add instances such as Integral and Num.

0points · 22h ago

> There is no longer a reason for trivial flaws such as this surviving to release code.

This is unrelated to "no longer due to ChatGPT times".

We've been able to detect this issue for decades already, see -Wunused-variable.

topspin · 22h ago

But:

    DIV_ROUND_UP(psize, cols);

I need sleep, so I'm not going to check, but that may appear to be an intermediate value to a C compiler, and thus escape -Wunused-variable.

aaronmdjones · 19h ago

> that may appear to be an intermediate value to a C compiler, and thus escape -Wunused-variable.

It does indeed (I checked):

    $ cat test.c
    
    #include <stdint.h>
    
    #define     DIV_ROUND_UP(n, d)  (((n) + (d) - 1) / (d))
    
    struct vdev;
    typedef struct vdev vdev_t;
    
    struct vdev
    {
        uint64_t     vdev_ashift;
        void        *vdev_tsd;
        vdev_t      *vdev_top;
    };
    
    struct vdev_raidz;
    typedef struct vdev_raidz vdev_raidz_t;
    
    struct vdev_raidz
    {
        int          vd_original_width;
        int          vd_physical_width;
        int          vd_nparity;
    };
    
    extern uint64_t vdev_raidz_get_logical_width(vdev_raidz_t *, uint64_t);
    
    static uint64_t
    vdev_raidz_asize_to_psize(vdev_t *vd, uint64_t asize, uint64_t txg)
    {
        vdev_raidz_t *vdrz = vd->vdev_tsd;
        uint64_t psize;
        uint64_t ashift = vd->vdev_top->vdev_ashift;
        uint64_t cols = vdrz->vd_original_width;
        uint64_t nparity = vdrz->vd_nparity;
    
        cols = vdev_raidz_get_logical_width(vdrz, txg);
    
        psize = (asize >> ashift);
        psize -= nparity * DIV_ROUND_UP(psize, cols);
        psize <<= ashift;
    
        return (asize);
    }



    $ gcc-14 -Wall -Wextra -Wpedantic -Wunused -Wunused-variable -Wunused-but-set-variable -c test.c -o test.o
    
    test.c:28:5: warning: ‘vdev_raidz_asize_to_psize’ defined but not used [-Wunused-function]
       28 |     vdev_raidz_asize_to_psize(vdev_t *vd, uint64_t asize, uint64_t txg)
          |     ^~~~~~~~~~~~~~~~~~~~~~~~~



    $ clang-19 -Weverything -c test.c -o test.o
    
    test.c:33:31: warning: implicit conversion changes signedness: 'int' to 'uint64_t' (aka 'unsigned long') [-Wsign-conversion]
       33 |         uint64_t cols = vdrz->vd_original_width;
          |                  ~~~~   ~~~~~~^~~~~~~~~~~~~~~~~
    
    test.c:34:34: warning: implicit conversion changes signedness: 'int' to 'uint64_t' (aka 'unsigned long') [-Wsign-conversion]
       34 |         uint64_t nparity = vdrz->vd_nparity;
          |                  ~~~~~~~   ~~~~~~^~~~~~~~~~
    
    test.c:28:5: warning: unused function 'vdev_raidz_asize_to_psize' [-Wunused-function]
       28 |     vdev_raidz_asize_to_psize(vdev_t *vd, uint64_t asize, uint64_t txg)
          |     ^~~~~~~~~~~~~~~~~~~~~~~~~
    
    3 warnings generated.

(The unused-function diagnostic is expected for this test)

I would have also expected clang to catch the double assignment to cols, to be honest.

0points · 17h ago

Oh, that's unfortunate :-(

I don't know these compiler internals, but how come the use of DIV_ROUND_UP() make the compilers ignore the unused state of the end variable?

I mean, after the DIV_ROUND_UP(), theres a bit shift:

    psize <<= ashift;

Only after the bit shift, nobody uses the result.

Shouldn't this be enough clues for the compiler to realize the variable is unused?

aaronmdjones · 10h ago

> Shouldn't this be enough clues for the compiler to realize the variable is unused?

You would think so! I can't imagine why it behaves this way.

topspin · 8h ago

> I can't imagine why it behaves this way.

The warning implementation must be simple minded: if the variable is ever read then it isn't "unused." The compiler need only set a flag on every read, and later scan all the flags. What you wish it did is track the lexical position of the last assignment and detect when pointless assignment occurs. This is clearly more complex and also has a higher performance cost. So for reasons of laziness and/or performance, the developers just haven't bothered.

That's what I imagine. Having dealt with C compilers for a long time now, it doesn't occur to me to expect better, so I guessed correctly that Wunused was not applicable here.

topspin · 15h ago

> (I checked)

Thank you.

dwroberts · 23h ago

> I found it. All that tells you is that it's a simple problem

Not totally clear what you mean here - are you saying you’re the author of the article or PR, or that you independently discovered the same issue (after the fact)?

topspin · 23h ago

Ok, so somehow that is causing confusion. I will clarify.

The author asked that the reader attempt to find the flaw by inspecting the provided code. The flaw was obvious enough that I was able to find it. The implication is that if it were less obvious, I might not have. I was not attempting to take any credit at all: exactly the opposite.

rini17 · 23h ago

Don't forget author did 99% of the work for you by finding that function.

topspin · 23h ago

Having wrote "Had I not been told it was broken I likely would not have," should make it clear that this wasn't lost on me.

gf000 · 22h ago

I mean, once you pinpoint 5 lines of code and say that there is an error, it is not particularly hard to find it.

Now give the whole repo before the fix to an LLM and ask it to find this bug with some context the author started with. I'm sure you can fix every software issue that way as well!

spacecadet · 22h ago

Like gibberlink. Programming languages are insane to me. In a way we created them to communicate with machines, but they evolved to complement a very human workflow. It makes little sense outside of assisting humans to train LLMs on human computer languages, why not let the AIs generate their own optimized languages?

Roark66 · 19h ago

>It makes little sense outside of assisting humans to train LLMs on human computer languages, why not let the AIs generate their own optimized languages?

They produce so much BS, but at least now we can catch most of it. If we the languages are suddenly optimised for the AIs we couldn't do that anymore.

topspin · 22h ago

> why not let the AIs generate their own optimized languages?

Excellent question. What would it end up looking like? Would it be full of monads and HKTs? Lambda calculus? WebAssembly?

I hope to learn the answer one day.

fredoralive · 1d ago

I missed it, but I was distracted by cols variable being initialised with the original width, but then being immediately overwritten with the logical width.

prmoustache · 23h ago

Isn't it a case for eliminating all warnings and treat them as bugs?

It seems you are dooming your project the minute you start ignoring your first warning.

bloak · 23h ago

Most projects I've worked on treat warnings as bugs. It's annoying, sometimes, when you have to fiddle with the code and add lines to prevent bogus warnings from breaking the build, particularly when you're making a temporary change for debugging purposes, and I remember a couple of occasions when we had to insert a comment something like "This was needed to prevent a warning caused by a bug in GCC X.Y.Z (link to compiler bug on issue tracker)". But, on balance, it's worth it.

0points · 22h ago

Unexpected issue (for me).

I would have expected that the compiler should complain that psize is computed but unused.

Why isn't -Wunused-variable enabled in OpenZFS?

pja · 22h ago

The blog post suggests that, probably because it’s never been used, it’s just too noisy to turn on.

Arguably every unused variable in the code base is a potential bug like this waiting to chomp on some poor user’s data though, so arguably they should be turning it on & dealing with the consequences?

0points · 22h ago

> so arguably they should be turning it on & dealing with the consequences?

Most definitely, yes.

0points · 16h ago

FWIW: I learned in a sibling comment, that this compiler flag doesn't produce a warning in this specific case anyway (but I think it should!).

In general (for any programming language, really), I would advocate for enabling warnings and address them as you go along.

Because that unlocks the developer feature of being able to react when code changes introduces warnings, as they generally point to unsound code.

apple1417 · 1d ago

Having run into similar problems several times, I've also never really been satisfied with the solutions. You have to have different types to cause compile errors, but then you have throw those types away whenever you perform any actual operations on them.

    struct TypeA { int inner; };
    struct TypeB { int inner; };

    struct TypeA a1, a2;
    // whoops
    TypeB result = {.inner = a1.inner + a2.inner};

Don't get me wrong, where I've used this sort of pattern it's definitely caught issues, and it definitely massively lowers the surface area for this type of bug. But I still feel like it's quite fragile.

The use case I felt worked best was seperating out UTC datetime vs localised datetime structs - mainly since it's something you already had to use function calls to add/subtract from, you couldn't play with raw values.

flohofwoe · 1d ago

I'm not sure why C is blamed in this case when you can do exactly the same strong typing fix in C, and with C99 struct literals it's also not much worse to work with:

    typedef struct { size_t size; } PhysicalSize;
    typedef struct { size_t size; } AllocatedSize;

    PhysicalSize psize = { 123 }; // or { .size = 123 }
    AllocatedSize asize = { 234 };

    psize = asize; // error

...and in reverse, Rust wouldn't protect you from that exact same bug if the programmer decided to use usize like in the original C code.

IME overly strong typing is a double-edged sword though, on one hand it makes the code more robust, but on the other hand also more 'change resistant'.

I still would like to see a strong typedef variant in C so that the struct wrapping wouldn't be needed, e.g.:

https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3320.htm

beeb · 23h ago

Rust would warn you of an unused variable: "warning: value assigned to `psize` is never read"

bloak · 23h ago

And so would GCC: warning: variable psize set but not used [-Wunused-but-set-variable]

beeb · 23h ago

The fact that the bug slipped through the cracks highlights the importance of sane defaults.

flohofwoe · 22h ago

The warning is in the -Wall warning set, which tbh should be the bare minimum each C/C++ project enables (along with -Wextra, and -Werror).

a_t48 · 1d ago

FWIW you can do the same thing in cpp, too - but Rust’s syntax certainly makes it easier.

Voultapher · 20h ago

I'm surprised their testing didn't catch it. And while I get that turning on warnings in a code-base not designed with them can be annoying, `-Wunused` is certainly within reach. I've enabled and maintained `-Weverything` in a 30m LoC C code base that is older than myself. OpenZFS should try harder to enable such warnings in their CI.

rollcat · 19h ago

That's a bit unfair. The size and age of the codebase speaks nothing of its quality. I've thrown out 50k 5yro Python code without a second thought.

Also, filesystems are an insanely complex problem, and ZFS is by far the most complex one to be widely deployed.

Narew · 1d ago

It remind me a devlog from andrew kelley on the same topics : https://ziglang.org/devlog/2024/#2024-11-04

giingyui · 19h ago

How come in that function the local variable cols is assigned a value and then immediately it’s assigned another value without the original value being read?

Uriopass · 23h ago

In this case any simple unit test would have catched it. Surprised it wasn't mentioned in the post.

atiq-ca · 1d ago

Thanks contribution to openzfs.

v5v3 · 22h ago

How much would it cost/how long to give an LLM the entire OpenZfs repo and ask it to find every bug?

v5v3 · 22h ago

To the person who downvoted this...

accept the world has changed my friend!!

And more so, as the NSA, China, Russia etc will be using llms to comb repo's to find zero day vulnerabilities.

jmclnx · 18h ago

For my personal use, I have no need for ZFS on Linux (ext4) or *BSD (UFS/ffs).

I have been playing with NetBSD for a bit and I discovered their FS has a log parameter, using that on mount to me makes their file system almost as good as ext4. Interesting to me it is not advertised too much.

v5v3 · 22h ago

Gpt4o-mini response:

>The provided code snippet is a function that converts an allocation size (asize) into the largest physical size (psize) that can be safely written to an allocation for a given virtual device (vdev). Here are some observations and potential issues in the code:

Return Value: The function currently returns asize instead of psize. This seems to be a mistake, as the purpose of the function is to compute and return the psize. The correct return statement should be:

    return psize;

Comment Clarity: The comments mention that the function does not take into account the effect of gang headers. It might be helpful to clarify what "gang headers" are and how they affect the allocation sizes, especially for someone unfamiliar with the context.

Variable Initialization: The variable psize is calculated based on asize, ashift, cols, and nparity. Ensure that these values are correctly initialized and that vdev_raidz_get_logical_width returns a valid value for cols. If cols is zero, it could lead to division by zero in the DIV_ROUND_UP macro.

Assertions: The assertion ASSERT0(asize % (1 << ashift)); checks that asize is a multiple of 2^ashift. If this assertion fails, it will cause the program to terminate. Ensure that this condition is always met before calling this function, or consider handling this case more gracefully.

Macro Usage: The DIV_ROUND_UP macro should be checked to ensure it handles edge cases correctly, such as when psize is zero. If psize is zero, the division could lead to unexpected behavior.

Code Readability: Consider adding more comments to explain the logic behind the calculations, especially for someone who may not be familiar with the RAID-Z implementation.

Error Handling: There is no error handling in the function. Depending on the context in which this function is used, it might be beneficial to add checks for invalid input values and return an error code or handle exceptions accordingly.

Here is a revised version of the function with the return value corrected:

.....(Snipped for formatting issues)

Make sure to test the function thoroughly to ensure it behaves as expected under various conditions.

nottorp · 23h ago

So why is the solution Rust, and not any of the other 2000 "modern" languages?

desdenova · 23h ago

So give me a list of the 2000 languages OpenZFS could've been written in.

mrlonglong · 13h ago

Cobol /s

Ask HN: To what extend have you stopped or limited your use of AI?

Tell HN: uBlock Origin on Chrome is finally gone

Ask HN: What do you do with your list of articles links?

Ask HN: How did Soham Parekh get so many jobs?

Should LLMs ask "Is this real or fiction?" before replying to suicidal thoughts?

Co-founder exiting after pivot – what's a fair exit package?

Tell HN: I Lost Joy of Programming

Ask HN: How to Create Data Flow Diagrams?

Ask HN: Good non tech companies to work at

Ask HN: What techniques do you use to remember complex concepts and theories?

The computational cost of corporate rebranding

Helpful function to find memory leaks in JavaScript

Fuel switches cut off before Air India crash that killed 260

Ask HN: Bug Bounty Dilemma – Take the $$ and Sign an NDA or Go Public?

Ask HN: People who work different timezones than your company. How sched?

Ask HN: How can I invest in Solar Power?

I built a AI transcription app because my girlfriend needed one for uni

Ask HN: Does your on-call rotation suck? Can I join it?

Ask HN: Worth leaving position over push to adopt vibe coding?

Google fails to dismiss wiretapping claims on SJ, settles with app users

Ask HN: What's the verdict on GPT wrapper companies these days?

Ask HN: Any resources for finding non-smart appliances?

Ask HN: What are some cool or underrated tech companies based in Australia?

Ask HN: Advice for Starting a Hacker Space?

Ask HN: New RevOps guy wants to switch us from M365 to GSuite+Slack

Ask HN: What's Your Experience with Vibe Coding?

What's your experience using Lynx (mobile framework)?

Ask HN: What inspires you to persevere through adversity?

Ask HN: Has anyone else learned English just by reading tech posts (like HN)?

An almost catastrophic OpenZFS bug and the humans that made it

Comments (59)