Kent is in the wrong. Having a lead position in development I would kick Kent of the team.
One thing is to challenge things. What Kent is doing is something completely different. It is obvious he introduced a feature, not only a Bugfix.
If the rules are set in a way that rc1+ gets only Bugfixes, then this is absolutely clear what happens with the feature. Tolerating this once or twice is ok, but Kent is doing this all the time, testing Linus.
Linus is absolutely in the right to kick this out and it's Kent's fault if he does so.
pmarreck · 9h ago
This can happen with primadonna devs who haven't had to collaborate in a team environment for a long time.
It's a damn shame too because bcachefs has some unique features/potential
bgwalter · 8h ago
bcachefs is experimental and Kent writes in the LWN comments that nothing would get done if he didn't develop it this way. Filesystems are a massive undertaking and you can have all the rules you want. It doesn't help if nothing gets developed.
It would be interesting how strict the rules are in the Linux kernel for other people. Other projects have nepotistic structures where some developers can do what they want but others cannot.
Anyway, if Linus had developed the kernel with this kind of strictness from the beginning, maybe it wouldn't have taken off. I don't see why experimental features should follow the rules for stable features.
yjftsjthsd-h · 8h ago
If it's an experimental feature, then why not let changes go into the next version?
bgwalter · 7h ago
That is a valid objection, but I still think that for some huge and difficult features the month long pauses imposed by release cycles are absolutely detrimental.
Ideally they'd be developed outside the kernel until they are perfect, but Kent addresses this in his LWN comment: There is no funding/time to make that ideal scenario possible.
jethro_tell · 7h ago
He could release a patch that can be pulled by the people that need it.
If you’re using experimental file systems, I’d expect you to be pretty competent in being able to hold your own in a storage emergency, like compiling a kernel if that’s the way out.
This is a made up emergency, to break the rules.
Analemma_ · 6h ago
This position seems so incoherent. If it’s so experimental, why is it in the mainline kernel? And why are fixes so critical they can’t wait for a merge window? Who is using an “experimental” filesystem for mission-critical work that also has to be on untested bleeding-edge code?
Like the sibling commenter, I suspect the word “experimental” is being used here to try and evade the rules that, somehow, every other part of the kernel manages to do just fine with.
dataflow · 37m ago
> evade the rules that, somehow, every other part of the kernel manages to do just fine with
I have no context on the situation or understanding of what the right set of rules here is, but the difference between filesystems and other code is that bugs in the filesystem can cause silent, persistent corruption for user data, on top of all the other failure modes. Most other parts of the kernel don't have such a large persistence capability in case of failure. So I can understand if filesystems feel the need to be special.
koverstreet · 6h ago
No, you have to understand that filesystems are massive (decade+) projects, and one of the key things you have to do with anything that big that has to work that perfectly is a very gradual rollout, starting with the more risk tolerant users and gradually increasing to a wider and wider set of users.
We're very far along in that process now, but it's still marked as experimental because it is not quite ready for widespread deployment by just anyone. 6.16 is getting damn close, though.
That means a lot of our users now are people getting it from distro kernels, who often have never compiled a kernel before - nevertheless, they can and do report bugs.
And no matter where you are in the rollout, when you get bug reports you have to fix them and get the fixes out to users in a timely manner so that they can keep running, keep testing and reporting bugs.
It's a big loss if a user has to wait 3 months for a bugfix - they'll get frustrated and leave, and a big part of what I do is building a community that knows how the system works, how to help debug, and how to report those bugs.
A very common refrain I get is "it's experimental <expletive deleted>, why does it matter?" - and, well, the answer is getting fixes out in a timely manner matters just as much if not more if we want to get this thing done in a timely manner.
hamandcheese · 2h ago
I am sympathetic to your plight. I work on internal dev tools and being able to get changes out to users quickly is an incredible tool to be able serve them well and earn (or keep) their trust.
It seems like you want that kind of fast turn around time in the kernel though, which strikes me as an impossible goal.
orbisvicis · 5h ago
Isn't this the point of DKMS, to decouple module code from kernel code?
koverstreet · 5h ago
Well, my hope when bcachefs was merged was for it to be a real kernel community project.
At the time it looked like that could happen - there was real interest from Redhat prior to merging. Sadly Redhat's involvement never translated into much code, and while upstreaming did get me a large influx of users - many of which have helped enormously with the QA and stabilization effort - the drama and controversies have kept developers away, so on the whole it's meant more work, pressure and stress for me.
So DKMS wouldn't be the worst route, at this point. It would be a real shame though, this close to taking the experimental label off, and an enormous hassle for users and distributions.
But it's not my call to make, of course. I just write code...
webstrand · 3h ago
DKMS is an awful user experience, it's an easy way to render a system unbootable. I hope Linus doesn't force existing users, like me, down that path. It's why I avoid zfs, however good it may be.
alphazard · 5h ago
> Having a lead position in development I would kick Kent of the team.
I've seen this sentiment a lot lately. That disagreeable top performers have to be disposed of because they are "toxic" or "problematic".
You aren't doing your job as a leader if this is your attitude to good engineers. Engineering is a field where a small amount of the people create a large amount of the value. You can either understand that, and take it upon yourself to integrate disagreeable yet high performing people into the team, paving over the rough patches yourself. Or you can oust them, and quite literally take a >50% productivity hit on your team.
A disagreeable person will take up more of your time as a manager, but a high performer is worth significantly more of your time. When these traits co-occur in the same person, the cost-benefit is complicated. The reason we talk about this problem a lot in tech is because it is legitimately a tough call, with errors in both directions. Wishing that the right move was always as simple as kicking someone off the team doesn't make it true, although it may relieve you from having to contend with the decision.
viraptor · 5h ago
> Or you can oust them, and quite literally take a >50% productivity hit on your team.
In a short term, possibly. But do you think bcachefs is better in the current situation than if it moved at half the speed, but without conflict? By being out of kernel it will get less testing, fewer contributions, the main developer will get some time wasted on rebasing the patch set with every release, and distros are unlikely to expose bcachefs to the user any time soon. When you're working with an ecosystem / teams, single person's performance really doesn't mean that much in the end. And occasionally Kent will still have to upstream some changes to interfaces - how likely is anyone to review/merge them quickly?
And now, what are the chances this will ever become more than a single person project really?
alphazard · 4h ago
It would be worse for bcachefs and the kernel if they parted ways. The Linux kernel does not have a feature complete alternative to APFS. Apple, of all companies, is beating the Linux kernel at filesystems. That hasn't happened before.
> When you're working with an ecosystem / teams, single person's performance really doesn't mean that much in the end.
This is demonstrably not true. Kent brought Bcachefs to fruition and got it upstreamed. Wireguard was also one guy. The cryptography used in both, also 1 guy. There's an argument to be made that given an elegant, well designed system, we should assume it came from a single or a few minds. But given a system that's been around for a while, you would be right to assume that a lot of people were/are involved in keeping it around.
koverstreet · 5h ago
It's not one or the other.
Ideally, you teach people how to get along better together; I think of my job as manager (and I effectively do manage a large team these days) as one of teaching and fostering good communication.
Pet_Ant · 11h ago
Why take it out of the kernel? Why not just make someone responsible the maintainer so they can say "no, next release" to his shenanigans? It can't be the license.
tliltocatl · 10h ago
Maintainers aren't getting paid and so cannot be "appointed". Someone must volunteer - and most people qualified and motivated enough are already doing something else.
timewizard · 10h ago
Presumably there would be an open call where people would nominate themselves for consideration. These are problems that have come up and been solved in human organizations for hundreds of years before the kernel even existed.
xorcist · 8h ago
There is no call. Anyone can volunteer at any time.
Software take up no space and there is no scarcity. Theoretically there could be any number of maintainers and what gets uptake is the de facto upstream. That's what people refer to when they talk about free software development in terms of meritocracy.
timewizard · 3h ago
How would they know to volunteer? Are you saying I can perform a hostile volunteering to take over for a maintainer who does not want to give up the project? I don't think you understood what was meant.
criticalfault · 11h ago
This is for me unclear as well, but I'm saying I wouldn't hold it against Linus if he did this. And based on Kent's behavior he has full right to do so.
A way to handle this would be with one person (or more) in between Kent and Linus. And maybe a separate tree only for changes and fixes from bcachefs that those people in between would forward to Linus. A staging of sorts.
nolist_policy · 11h ago
Kent can appoint a suitable maintainer if he wishes. That's his job, not Linus'.
monkeyelite · 5h ago
What you’re saying make sense but do we know the social convention about bugfix classification in Linux?
My job matches what you’re describing, but bug fix is widely interpreted. It basically means “managers don’t do anything stupid”.
If someone got in trouble using language you are desiring “the rules are clear and were broken”, i would feel they were singling someone out.
redeeman · 6h ago
there have been several examples of other exceptions. we are talking data corruption here. Kent may not be the best communicator, but he cares about what matters. you'd rather see people lose their data than bending rules.
ajb · 10h ago
Yeah.. the thing is, suppose Kent was 100% right that this needed to be merged in a bugfix phase, even though it's not a bug fix. It's still a massive trust issue that he didn't flag up that the contents of his PR was well outside the expected.
That means Linus has to check each of his PRs assuming that it might be pushing the boundaries without warning.
No amount of post hoc justification gets you that trust back, not when this has happened multiple times now.
NewJazz · 10h ago
He mentioned it in his PR summary as a new option. About half of the summary of the original PR was talking about the new option and why it was important.
I'm not saying he made a PR just saying "Fixes" like a rookie. What I'm saying is that in there should have been something along the lines of "heads up - I know this doesn't comply with the usual process for the following commits, here's why I think they should be given a waiver under these circumstances" followed by the justifications that appeared after Linus got upset.
The PR description would have been fine - if it had been in the right stage of the process.
dralley · 12h ago
I donate to Kent's patreon and I'm very enthusiastic about bcachefs.
However, Kent, if you read this: please just settle down and follow the rules. Quit deliberately antagonizing Linus. The constant drama is incredibly offputting. Don't jeopardize the entire future of bcachefs over the silliest and most temporary concerns.
If you absolutely must argue about some rule or other, then make that argument without having your opening move be to blatantly violate them and then complain when people call you out.
You were the one who wanted into the kernel despite many suggestions that it was too early. That comes with tradeoffs. You need to figure out how to live with that, at least for a year or two. Stop making your self-imposed problems everyone else's problems.
thrtythreeforty · 9h ago
I did subscribe to his Patreon but I stopped because of this - vote with your wallet and all that. I would happily resubscribe if he can demonstrate he can work within the Linux development process. This isn't the first time this flavor of personality clash has come up.
Kent is absolutely technically capable of, and has the vision to, finally displace ext4, xfs, and zfs with a new filesystem that Does Not Lose Data. To jeopardize that by refusing to work within the well-established structure is madness.
NewJazz · 10h ago
Seriously how hard is it to say "I'm unhappy users won't have access to this data recovery option but will postpone its inclusion until the next merge window". Yeah, maybe it sucks for users who want the new option or what have you, but like you said it is a temporary concern.
vbezhenar · 9h ago
Why does it suck for users? Those brave enough to use new filesystem, surely can use custom kernel for the time being, while merge effort is underway and vanilla kernel might not be the most stable option.
mjevans · 6h ago
I think the others have been proven correct. It _is_ too early. It would have been better maintained in one of the other staging branches to brew and also possibly as a patch-set that could be added atop the vanilla branch.
chasil · 12h ago
So the assertion is that users with (critical) data loss bugs need complete solutions for recovery and damage containment with all possible speed, and without this "last mile" effort, stability will never be achieved.
The objection is the tiniest bug-fix windows get everything but the kitchen sink.
These are both uncomfortable positions to occupy, without doubt.
koverstreet · 10h ago
No, the assertion is that the proper response to a bug often (and if it's high impact - always) involves a lot more than just the bugfix.
And the whole reason for a filesystem's existence is to store and maintain your data, so if that is what the patch if for, yes, it should be under consideration as a hotfix.
There's also the broader context: it's a major problem for stabilization if we can't properly support the people using it so they can keep testing.
More context: the kernel as a whole is based on fixed time tables and code review, which it needs because QA (especially automated testing) is extremely spotty. bcachefs's QA, both automated testing and community testing, is extremely good, and we've had bugfix patchsets either held up or turn into flamewars because of this mismatch entirely too many times.
magicalhippo · 8h ago
While I absolutely think you're taking a stand in the wrong fights, like I don't see why you needed to push it so far on this hill in particular, I am sympathetic to your argument that experimental kernel modules like filesystems might need a different release approach at times.
At work we have our main application which also contains a lot of customer integrations. Our policy has been new features in trunk only, except if it's entirely contained inside a customer-specific integration module.
We do try to avoid it, but this does allow us to be flexible with regards to customer needs, while keeping the base application stable.
This new recovery feature was, as far as I could see, entirely contained within the bcachefs kernel code. Given the experimental status, as long as it was clearly communicated to users, I don't see a huge problem allowing such self-contained features during the RC phase.
Obviously a requirement must be that it doesn't break the build.
No comments yet
WesolyKubeczek · 9h ago
> No, the assertion is that the proper response to a bug often (and if it's high impact - always) involves a lot more than just the bugfix.
Then what you do is you try to split your work in two. You could think of a stopgap measure or a workaround which is small, can be reviewed easily, and will reduce the impact of the bug while not being a "proper" fix, and prepare the "properer" fix when the merge window opens.
I would ask, since the bug probably lived since the last stable release, how come it fell through the crack and had only been noticed recently? Could it be that not all setups are affected? If so, can't they live with it until the next merge window?
By making a "feature that fixes the bug for real", you greatly expand the area in which new, unknown bugs may land, with very little time to give it proper testing. This is inevitable, evident by the simple fact that the bug you were trying to fix exists. You can be good, but not that good. Nobody is that good. If anybody was that good, they wouldn't have the bug in the first place.
If you have commercial clients who use your filesystem and you have contractual obligations to fix their bugs and keep their data intact, you could (I'd even say "should") maintain an out-of-tree version with its own release and bugfix schedule. This is IMO the only reasonable way to have it, because the kernel is a huge administrative machine with lots of people, and by mainlining stuff, you necessarily become co-dependent on the release schedule for the whole kernel. I think a conflict between kernel's release schedule and contractual obligations, if you have any, is only a matter of time.
koverstreet · 6h ago
> Then what you do is you try to split your work in two. You could think of a stopgap measure or a workaround which is small, can be reviewed easily, and will reduce the impact of the bug while not being a "proper" fix, and prepare the "properer" fix when the merge window opens.
That is indeed what I normally do. For example, 6.14 and 6.15 had people discovering btree iterator locking bugs (manifesting as assertion pops) while running evacuates on large filesystems (it's hard to test a sufficiently deep tree depth in virtual machine tests with our large btree nodes); some small hotfixes went out in rc kernels, but the majority of the work (a whole project to add assertions for path->should_be_locked, which should shut these down for good) waited until the 6.16 merge window.
That was for a less critical bug - your machine crashing is somewhat less severe than losing a filesystem.
In this case, we had a bug pop up in 6.15 where the link count in the VFS inode getting screwed up caused an inode to be deleted that shouldn't have been - a subvolume root - and then an untested repair path took out the entire subvolume.
Ouuuuch.
That's why the repair code was rushed; it had already gotten one filesystem back, and I'd just gotten another report of someone else hitting it - and for every bug report there are almost always more people who hit it and don't report it.
And considering that a lot of people running bcachefs now are getting it from distro kernels and don't know how to build kernels - that is why it was important to get this out quickly through the normal channels.
In addition, the patch wasn't risky, contrary to what Ted was saying. It's a code path that's very well covered by automated tests, including KASAN/UBSAN/lockdep variants - those would exploded if this patch was incorrect.
When to ship a patch is always a judgement call, and part of how you make that call is how well your QA process can guarantee the patch is correct. Part of what was going on here is a disconnect between those of us who do make heavy use of modern QA infrastructure and those who do it the old school way, relying heavily on manual review and long testing periods for rc kernels.
jethro_tell · 7h ago
Who’s using an experimental filesystem and risking critical data loss? Rule one of experimental file systems is have a copy on a not experimental file system.
alphazard · 6h ago
This whole debacle is the perfect advertisement for microkernels. The only reason Kent needs to coordinate with Linus is because filesystems need to live in the kernel. FUSE is second class. Imagine how much easier this all would be if linux maintained a slowly evolving filesystem API, and all bcachefs had to do was keep up with it.
skissane · 3h ago
I don’t think FUSE is deliberately a “second class citizen”, it is simply that doing a filesystem in user space has a performance cost compared to doing it in the kernel-and that is a very tricky problem to solve. Even microkernels have this problem, it is just you don’t notice it as readily because a pure microkernel doesn’t offer in-kernel filesystems as a comparator - but if you take a microkernel and transform it into a hybrid kernel by moving filesystems (and block device drivers) into kernel space, like NeXT/Apple did in turning Mach into XNU, almost certainly you are going to see tangible performance gains. Maybe this is less true with more modern microkernel designs such as L4, but even there I suspect it is still true, even if not to quite the same extent.
I think the performance cost of FUSE compared to in-kernel filesystems is improving with time - FUSE with io_uring is a big step forward, but the immaturity of io_uring is an obstacle to its adoption, at least in the short-to-medium-term. I’m sure in the future we’ll see even further improvements in this area. But will we ever reach the Nirvana where FUSE equals the performance of in-kernel filesystems, or (maybe more realistically) the performance overhead has become so marginal nobody is bothered by it in practice? I’d like to think we eventually will, but it is far from certain.
koverstreet · 1h ago
There's no inherent reason why FUSE has to be noticably slower for buffered IO, it just hasn't gotten nearly enough well thought it attention. But that's starting to change, there's a lot more interest these days in a faster FUSE.
Direct IO would be slower via FUSE, but L4 style IPC could solve that.
It would be an interesting proposition, although not my first choice for the direction I want to go in :)
skissane · 47m ago
I think the issue with any new physical filesystem, is even if it becomes mature, fully upstream as part of the mainline Linux kernel, and supported out-of-the-box by all the major distributions - still a lot of people are just never going to use it, because there is so much competition in that space (ext4, XFS, btrfs, etc), people are understandably quite conservative (fear of data loss due to bugs), and the fear that a less popular filesystem may end up being abandoned if something unexpected happens to its primary developer (see e.g. ReiserFS)
By contrast, improvements in performance of FUSE, L4-style IPC, could be much more widely beneficial-both for developers of new physical filesystems (by making possible in-user space implementations where they can iterate faster, get better API/ABI stability, easier adoption by end-users), but also for developers of numerous other pieces of software too
Of course, you personally are going to scratch the itch you want to scratch. But in terms of what’s most beneficial for the Linux ecosystem as a whole, I think FUSE improvements and L4-style IPC would deliver the most benefit per unit of effort
A lot of open source volunteers can't really be replaced because there is no one willing to volunteer to maintain that thing. This is complicated by the fact that people mostly get credit for creating new projects and no credit for maintenance. Anyone who could take over bcachefs would probably be better off creating their own new filesystem.
heavensteeth · 3h ago
Whether or not you agree with Kent on this, you have to commend that he tends to be very active in discussing issues with the community in a fairly open, calm, and thought out way (at least from what I've seen).
Comparatively, I find subtweeting him from the sanctity of Mastodon, with a few insults and backhanded complements thrown in for good measure, a bit low.
ajb · 8h ago
Ehh. I don't think Kent is an arsehole. The problem with terms like "arsehole" that is that they conflate a bunch of different issues. It doesn't really have much explanatory power. Someone who is difficult to work with can be that way for loads of different reasons: Ego, tunnel vision, stress, neuro divergence (of various kinds), commercial pressures , greed, etc etc.
There is always a point where you have to say "no I can't work with this person any more", but while you are still trying to it's worth trying to figure out why someone is behaving as they do.
skissane · 4h ago
> The problem with terms like "arsehole" that is that they conflate a bunch of different issues.
Agree, plus I’d add: if we are going to criticise other people’s communication style/abilities or attitude, then using a vague, vulgar and hurtful slang term like “arsehole”/“asshole” (and similar slang such as “dick”, “prick”, etc) is an example of exhibiting the very thing one is complaining about in making the complaint, which is fundamentally hypocritical. One can state the same concerns in a more professional way, focusing on the details of the specific behaviour pattern not a vague term which can refer to lots of distinct behaviours (e.g. people with ASD traits who hurt the feelings of others because they honestly have trouble thinking about them, versus people with antisocial or narcissistic personality disorder traits who knowingly hurt the feelings of others because they enjoy doing so) - labelling the behaviour pattern not the person, acknowledging that it is entirely possibly due to an unintentional skills gap, (sub)culture clash, differences in life experiences, neurodiversity/neurodivergence/mental health/trauma, etc.
I also think it is helpful when criticising the flaws of others to try to relate them to one’s own, whenever possible - e.g. sometimes in the past I did X and from my perspective it looks like you are doing something similar-hurtful labels are not encouraging that kind of self-reflectiveness at all, they promote the idea that “I’m one of the good ones but you are one of the bad ones”
bgwalter · 4h ago
People who go on holier-than-thou rants like that are usually extremely unpleasant to work with and will cancel you (as directly admitted in that post) if you contradict them on anything.
jagged-chisel · 8h ago
For the uninitiated:
bCacheFS, not BCA Chefs. I’m not clued into the kernel at this level so I racked my brain a bit.
skissane · 3h ago
Given the context is an OS kernel, even if I hadn't heard of "bcachefs" before (which I have), parsing it as "bcache-fs" seems obvious.
By contrast, I don't know what "BCA Chefs" is supposed to be. "BCA" could be many things: "Barbados Cricket Association", "Billiard Congress of America", "British Caving Association", "Business Council of Australia", among others. But what would any of them have to do with "Chefs"?
zahlman · 7h ago
I had to think about it the first time, too.
zahlman · 7h ago
Does the filesystem actually need to be part of the kernel project to work? I can see where you'd need that for the root filesystem, but even then, couldn't one migrate an existing installation to a new partition with a different filesystem?
topspin · 1h ago
No, it does not. It might need to be part of the kernel to be included downstream in Linux distributions, if the file system developer fails to maintain distro buildable modules that don't eat people's data. Should developers provide workable modules, getting these modules into most distros is generally unhindered.
And no, it's not necessary even for root file systems. Linux can load modules, such as file system drivers, before it mounts root. That's what initramfs is about.
ZFSoL has thrived for 15 years, fostering several commercial empires, and has never been in Linus's mainline.
Bcachefs development may continue as Kent Overstreet wishes, and he need not squabble with Linus going forward. Seems like an entirely workable outcome. Kudos to Linus for a.) giving Kent a chance, despite known issues with Kent, and b.) making the difficult decision to reverse himself. Both of these decisions were correct.
What I learn from all of this is that Linus is still in the saddle and still making good calls. We are blessed.
gizmo686 · 6h ago
Even the root filesystem can be FUSE if you want it to. The only thing that needs to be in the kernel is the initial root filesystem driver. Nowadays that is pretty much always just compressed CPIO (initrd). At that point user space can do pretty much whatever before doing a pivot_root operation to wherever.
teekert · 6h ago
We ZFS for that. What we want is something in kernel, ready to go, 100% supported on root ok any Linux system with no license ambiguity. We want to replace ext4. Maybe btrfs can do it. I hear it has outgrown its rocky puberty.
alphazard · 6h ago
Bcachefs and Btrfs are not really competing with Ext4. There are basically 2 filesystem niches.
First niche is the full featured CoW filesystem; it has snapshots, detects and repairs corruption, transparent compression, all that good stuff.
The other niche is being an allocator of sectors. There's one storage device, divide it up amongst all these processes asking for storage. That's Ext4: an allocator of disk sectors, dressed up in a filesystem API. When you are running databases or VMs, all you want is an allocator of sectors. You don't want lots of stuff getting in the way of your writes. You don't want checksumming, you don't want your writes going to a new place every time. You just want write access to part of the disk.
No comments yet
msgodel · 11h ago
The older I get the more I feel like anything other than the ExtantFS family is just silly.
The filesystem should do files, if you want something more complex do it in userspace. We even have FUSE if you want to use the Filesystem API with your crazy network database thing.
heavyset_go · 10h ago
Transparent compression, checksumming, copy-on-write, snapshots and virtual subvolumes should be considered the minimum default feature set for new OS installations in TYOOL 2025.
You get that with APFS by default on macOS these days and those features come for free in btrfs, some in XFS, etc on Linux.
riobard · 9h ago
APFS checksums only fs metadata not user data which is a pita. Presumably because APFS is used on single drive systems and there’s no redundancy to recover from anyway. Still, not ideal.
vbezhenar · 9h ago
Apple trusts their hardware to do their own checksums properly. Modern SSD uses checksums and parity codes for blocks. SATA/NVMe include checksums for protocol frames. The only unreliable component is RAM, but FS checksums can't help here, because RAM bit likely will be flipped before checksum is calculated or after checksum is verified.
riobard · 8h ago
If they do trust their hardware, APFS won’t need to checksum fs metadata either, so I guess they don’t trust it well enough? Also I have external drives that is not Apple sanctioned to store files and I don’t trust them enough either, and there’s no choice of user data checksumming at all.
1over137 · 1h ago
Apple does not care about your external non-Apple drives. In the slightest.
londons_explore · 7h ago
Most SSD's can't be trusted to maintain proper data ordering in the case of a sudden power off.
That makes checksums and journals of only marginal usefulness.
I wish some review website would have a robot plug and unplug the power cable in a test rig for a few weeks and rate which SSD manufacturers are robust to this stuff.
yjftsjthsd-h · 11h ago
I mean, I'd really like some sort of data error detection (and ideally correction). If a disk bitflips one of my files, ext* won't do anything about it.
eptcyka · 6h ago
Bitflips in my files? Well, there’s a high likelihood that the corruption won’t be too bad. Bit flips in the filesystem metadata? There’s a significant chance all of the data is lost.
msgodel · 5h ago
Anything important should be really be stored in some sort of distributed system that uses eg merkle trees. If the file system also did that you'd be doing it twice which would be annoying.
Anything unimportant is really just being cached and it's probably fine if it gets corrupted.
timewizard · 9h ago
> some sort of data error detection (and ideally correction).
That's pretty much built into most mass storage devices already.
> If a disk bitflips one of my files
The likelihood and consequence of this occurring is in many situations not worth the overhead of adding additional ECC on top of what the drive does.
> ext* won't do anything about it.
What should it do? Blindly hand you the data without any indication that there's a problem with the underlying block? Without an fsck what mechanism do you suppose would manage these errors as they're discovered?
yjftsjthsd-h · 8h ago
To your first couple points: I trust hardware less than you.
> What should it do? Blindly hand you the data without any indication that there's a problem with the underlying block?
Well, that's what it does now, and I think that's a problem.
> Without an fsck what mechanism do you suppose would manage these errors as they're discovered?
Linux can fail a read, and IMHO should do so if it cannot return correct data. (I support the ability to override this and tell it to give you the corrupted data, but certainly not by default.) On ZFS, if a read fails its checksum, the OS will first try to get a valid copy (ex. from a mirror or if you've set copies=2), and then if the error can't be recovered then the file read fails and the system reports/records the failure, at which point the user should probably go do a full scrub (which for our purposes should probably count as fsck) and restore the affected file(s) from backup. (Or possibly go buy a new hard drive, depending on the extent of the problem.) I would consider that ideal.
throw0101d · 9h ago
>> > some sort of data error detection (and ideally correction).
> That's pretty much built into most mass storage devices already.
And ZFS has shown that it is not sufficient (at least for some use-cases, perhaps less of a big deal for 'residential' users).
> The likelihood and consequence of this occurring is in many situations not worth the overhead of adding additional ECC on top of what the drive does.
Not worth it to whom? Not having the option available at all is the problem. I can do a zfs set checksum=off pool_name/dataset_name if I really want that extra couple percentage points of performance.
> Without an fsck what mechanism do you suppose would manage these errors as they're discovered?
Depends on the data involved: if it's part of the file system tree metadata there are often multiple copies even for a single disk on ZFS. So instead of the kernel consuming corrupted data and potentially panicing (or going off into the weeds) it can find a correct copy elsewhere.
If you're in a fancier configuration with some level of RAID, then there could be other copies of the data, or it could be rebuilt through ECC.
With ext*, LVM, and mdadm no such possibility exists because there are no checksums at any of those layers (perhaps if you glom on dm-integrity?).
And with ZFS one can set copies=2 on a per-dataset basis (perhaps just for /home?), and get multiple copies strewn across the disk: won't save you from a drive dying, but could save you from corruption.
yjftsjthsd-h · 8h ago
> (perhaps if you glom on dm-integrity?).
I looked at that, in hopes of being able to protect my data.
Unfortunately, I considered this something of a fatal flaw:
> It uses journaling for guaranteeing write atomicity by default, which effectively halves the write speed.
Which implies you can already correct errors through a simple majority mechanism.
> or it could be rebuilt through ECC.
So just by having the appropriate level of RAID you automatically solve the problem. Why is this in the fs layer then?
yjftsjthsd-h · 2h ago
> Which implies you can already correct errors through a simple majority mechanism.
I don't think so? You set copies=2, and the disk says that your file starts with 01010101, except that the second copy says your file starts with 01010100. How do you tell which one is right? For that matter, even with only one copy ex. ZFS can tell that what it has is wrong even if it can't fix it, and flagging the error is still useful.
> So just by having the appropriate level of RAID you automatically solve the problem. Why is this in the fs layer then?
Similarly, you shouldn't need RAID to catch problems, only (potentially) to correct them. I do agree that it doesn't necessarily have to be in the FS layer, but AFAIK Linux doesn't have any other layers that do a good job of it (as mentioned above, dm-integrity exists but halving the write speed is a pretty big problem).
ars · 8h ago
> The likelihood .. of this occurring
That's 10^14 bits for a consumer drive. That's just 12TB. A heavy user (lots of videos or games) would see a bit flip a couple times a year.
magicalhippo · 7h ago
I do monthly scrubs on my NAS, I have 8 14-20TB drives that are quite full.
According to that 10^14 metric I should see read errors just about every month. Except I have just about zero.
Current disks are ~4 years, runs 24/7, and excluding a bad cable incident I've had a single case of a read error (recoverable, thanks ZFS).
I suspect those URE numbers are made by the manufacturers figuring out they can be sure the disk will do 10^14, but they don't actually try to find the real number because 10^14 is good enough.
ars · 5h ago
If you are using enterprise drives those are 10^16, so that might explain it.
magicalhippo · 4h ago
Fair, newest ones are, but two of my older current drives are IronWolfs 16TB which are 10^15 in the specs[1], and they've been running for 5.4 years. Again without any read errors, monthly scrubs, and of course daily use.
And before that I have been using 8x WD Reds 3TB for 6-7 years, which have 10^14 in the specs[2], and had the same experience with those.
Yes smaller size, but I ran scrubbing on those biweekly, and over so many years?
I'm not really sure how you're supposed to interpret those error rates. The average read error probably has a lot more than 1 flipped bit, right? And if the average error affects 50 bits, then you'd expect 50x fewer errors? But I have no idea what the actual histogram looks like.
timewizard · 3h ago
Is that raw error rate or uncorrected error rate?
anonnon · 11h ago
> The older I get the more I feel like anything other than the ExtantFS family is just silly.
The extended (not extant) family (including ext4) don't support copy-on-write. Using them as your primary FS after 2020 (or even 2010) is like using a non-journaling file system after 2010 (or even 2001)--it's a non-negotiable feature at this point. Btrfs has been stable for a decade, and if you don't like or trust it, there's always ZFS, which has been stable 20 years now. Apple now has AppFS, with CoW, on all their devices, while MSFT still treats ReFS as unstable, and Windows servers still rely heavily on NTFS.
NewJazz · 10h ago
CoW is an efficiency gain. Does it do anything to ensure data integrity, like journaling does? I think it is an unreasonable comparison you are making.
throw0101d · 9h ago
> Does it do anything to ensure data integrity, like journaling does?
What kind of journaling though? By default ext4 only uses journaling for metadata updates, not data updates (see "ordered" mode in ext4(5)).
So if you have a (e.g.) 1000MB file, and you update 200MB in the middle of it, you can have a situation where the first 100MB is written out and the system dies with the other 100MB vanishing.
With a CoW, if the second 100MB is not written out and the file sync'd, then on system recovery you're back to the original file being completely intact. With ext4 in the default configuration you have a file that has both new-100MB and stale-100MB in the middle of it.
The updating of the file data and the metadata are two separate steps (by default) in ext4:
Whereas with a proper CoW (like ZFS), updates are ACID.
webstrand · 9h ago
I use CoW a lot just managing files. It's only an efficiency gain if you have enough space to do the data-copying operation. And that's not necessarily true in all cases.
Being able to quickly take a "backup" copy of some multi-gb directory tree before performing some potentially destructive operation on it is such a nice safety net to have.
It's also a handy way to backup file metadata, like mtime, without having to design a file format for mapping saved mtimes back to their host files.
anonnon · 9h ago
> CoW is an efficiency gain.
You're thinking of the optimization technique of CoW, as in what Linux does when spawning a new thread or forking a process. I'm talking about it in the context of only ever modifying copies of file system data and metadata blocks, for the purpose of ensuring file system integrity, even in the context of sudden power loss (EDIT: wrong link): https://www.qnx.com/developers/docs/8.0/com.qnx.doc.neutrino...
If anything, ordinary file IO is likely to be slightly slower on a CoW file system, due to it always having to copy a block before said block can be modified and updating block pointers.
tbrownaw · 8h ago
> The extended (not extant) family (including ext4)
I read that more as "we have filesystems at home, and also get off my lawn".
milkey_mouse · 11h ago
Hell, there's XFS if you love stability but want CoW.
josephcsible · 11h ago
XFS doesn't support whole-volume snapshots, which is the main reason I want CoW filesystems. And it also stands out as being basically the only filesystem that you can't arbitrarily shrink without needing to wipe and reformat.
adrian_b · 1h ago
Many years ago, XFS did not support snapshots.
However, there is also a long time since XFS supports snapshots.
I am not sure what you mean by "whole-volume" snapshots, but I have not noticed any restrictions in the use of the XFS snapshots. As expected, they store a snapshot of the entire file system, which can be restored later.
In many decades of managing computers with all kinds of operating systems and file systems, on a variety of servers and personal computers, I have never had the need to shrink a file system. I cannot imagine how such a need can arise, except perhaps as a consequence of bad planning. There are also many decades since I have deprecated the use of multiple partitions on a storage device, with the exception of bootable devices, which must have a dedicated partition for booting, conforming to the BIOS or UEFI expectations. For anything that was done in the ancient times with multiple partitions there are better alternatives now. With the exception of bootable USB sticks with live Linux or FreeBSD partitions, I use XFS on whole SSDs or HDDs (i.e. unpartitioned), regardless if they are internal or external, so there is never any need for changing the size of the file system.
Even so, copying a file system to an external device, reformatting the device and copying the file system back is not likely to be significantly slower than shrinking in place. In fact sometimes it can be faster and it has the additional benefit that the new copy of the file system will be defragmented.
Much more significant than the lack of shrinking ability, which may slow down a little something that occurs very seldom, is that both EXT4 and XFS are much faster for most applications than the other file systems available for Linux, so they are fast for the frequent operations. You may choose another file system for other reasons, but choosing it for making faster a very rare operation like shrinking is a very weak reason.
MertsA · 7h ago
You can shrink XFS, but only the realtime volume. All you need is xfs_db and a steady hand. I once had to pull this off for a shortened test program for a new server platform at Meta. Works great except some of those filesystems did somehow get this weird corruption around used space tracking that xfs_repair couldn't detect... It was mostly fine.
kzrdude · 9h ago
there was the "old dog new tricks" xfs talk long time ago, but I suppose it was for fun and exploration and not really a sneak peek into snapshots
leogao · 10h ago
you can always have an LVM layer for atomic snapshots
josephcsible · 10h ago
There are advantages to having the filesystem do the snapshots itself. For example, if you have a really big file that you keep deleting and restoring from a snapshot, you'll only pay the cost of the space once with Btrfs, but will pay it every time over with LVM.
robotnikman · 11h ago
>Windows will at some point have ReFS
They seem to be slowly introducing it to the masses, Dev drives you set up on Windows automatically use ReFS
leogao · 10h ago
btrfs has eaten my data within the last decade. (not even because of the broken erasure coding, which I was careful to avoid!) not sure I'm willing to give it another chance. I'd much rather use zfs.
bombcar · 7h ago
I used reiserfs for awhile after I noticed it eating data (tail packing for the power loss) but quickly switched to xfs when it became available.
Speed is sometimes more important than absolute reliability, but it’s still an undesirable tradeoff.
msgodel · 11h ago
Again I don't really want the kernel managing a database for me like that, the few applications that need that can do it themselves just fine. (IME mostly just RDBMSs and Qemu.)
zahlman · 7h ago
... NTFS does copy-on-write?
... It does hard links? After checking: It does hard links.
... Why didn't any programs I had noticeably take advantage of that?
anonnon · 6h ago
> NTFS does copy-on-write?
No, it doesn't. Maybe you're thinking of shadow volume copies or something else. CoW files systems never modify data or metadata blocks directly, only modifying copies, with the root of the updated block pointer graph only updated after all other changes have been synced. Read this: https://www.qnx.com/developers/docs/8.0/com.qnx.doc.neutrino...
zahlman · 4h ago
>No, it doesn't. Maybe you're thinking of shadow volume copies or something else.
I was asking, because didn't know, and I thought the other person was implying that it did.
I know what copy-on-write is.
anonnon · 3h ago
The "other person" (only mention of NTFS) is me, here:
> while MSFT still treats ReFS as unstable, and Windows servers still rely heavily on NTFS.
By this I implied it's an embarrassment to MSFT that iOS devices have a better, more reliable file system (AppFS) than even Windows servers now (having to rely on NTFS until ReFS is ready for prime time). If HN users and mods didn't tone-police so heavily, I could state things more frankly.
baggy_trough · 12h ago
No matter how good the code is, Overstreet's behavior and the apparent bus factor of 1 leave me reluctant to investigate this technology.
dsp_person · 12h ago
Curious about this process. Can anyone submit patches to bcachefs and Kent is just the only one doing it? Is there a community with multiple contributors hacking on the features, or just Kent? If not, what could he do to grow this? And how does a single person receiving patreon donations affect the ability of a project like this to get passed bus factor of 1?
koverstreet · 10h ago
I take patches from quite a few people. If the patch looks good, I'll generally apply it.
And I encourage anyone who wants to contribute to join the IRC channel. It's not a one man show, I work with a lot of people there.
nolist_policy · 11h ago
Generally you need a maintainer for your subsystem who sends pull requests to Linus.
ddtaylor · 5h ago
I fear posting this because it's YouTube content and I don't know of the creator very well beyond these videos, but I have been following this saga a bit from this creator:
He gets a few words wrong because my understanding is he covers the topic in a more broad way, but most of his coverage seems objective and factual. He does have some opinions, but I think it's closer to journalism of the LKML than an opinion piece.
layer8 · 11h ago
For some reason I always read this as “BCA chefs”.
today Kent posted another rc patch with a new filesystem option. But it was merged..
shmerl · 12h ago
May be bcachefs should have been governed by a group of people, not a single person.
mananaysiempre · 11h ago
Committees are good-to-acceptable for keeping things going, but bad for initial design or anything requiring a coherent vision and taste. There are some examples of groups that straddled the boundary between a committee and a creative collaboration and produced good designs (Algol 60; RnRS for n ≤ 5; IIRC the design of ZFS was produced by a three-person team), but they are more of an exception, and the secret of tying together such groups remotely doesn’t seem to have been cracked. Even in the keeping things going department, a committee’s inbuilt and implicit self-preservation mechanisms can lead it to keep fiddling with things far longer than would be advisable.
koverstreet · 10h ago
Actually, I think remote collaboration can work with the right medium and tools. For bcachefs, that's been IRC; we have an extremely active channel where we do a lot of collaborative debugging, design discussion, helping new users, etc.
I know a lot of people heavily use slack/discord these days, but personally I find the web interfaces way too busy. IRC all the way, for me.
But the problem of communicating effectively enough to produce a coherent design is very real - this goes back to Fred Brooks (Mythical Man Month). I think bcachefs turned out very well with the way the process has gone to date, and now that it's gotten bigger, with more distinct subsystems, I am very eagerly looking forward to the date when I can hand off ownership of some of those subsystems. Lately we've had some sharp developers getting involved - for the past several years it's been mainly users testing it (and some of them have gotten very good at debugging at this point).
So it's happening.
shmerl · 10h ago
In this case it's more about keeping things in check and not letting one person with an attitude to ignore kernel development rules derail the whole project.
I'm not saying those concerns are wrong, but when it's causing a fallout like being kicked out from the kernel, the downsides clearly are more severe than any potential benefits.
guerrilla · 12h ago
The drama with Linux filesystems is just nuts... It never ends.
msgodel · 11h ago
It's crazy people spend so much time paying attention to Hollywood celebrity drama.
Opens LKML archive hoping for another Linus rant.
rendaw · 10h ago
I'm sure there's just as much political allstar programmer fighting at google/apple/microsoft/whatever too, just this is done in public.
quotemstr · 11h ago
At least Kent hasn't murdered his wife
Tostino · 9h ago
First thing that came to mind when I saw this drama.
mschuster91 · 12h ago
The stakes are the highest across the entire kernel. Data that's corrupt cannot (easily) be uncorrupted.
tpolzer · 12h ago
Bad drivers could brick (parts of) your hardware permanently.
While you should have a backup of your data anyway.
devwastaken · 12h ago
Good. There is no place for unstable developers in a stable kernel.
anonfordays · 8h ago
Linux needs a true answer to ZFS that's not btrfs. Sadly the ship has sailed for btrfs, after 15+ years it's still not something trustable.
Apparently bcachefs won't be the successor. Filesystem development for Linux needs a big shakeup.
em-bee · 7h ago
several people i know are using btrfs without problems for years now. i use it on half a dozen devices. what's your evidence that it is not trustable?
yjftsjthsd-h · 2h ago
In this case, some people using it and not having problems is much less interesting than some people that are having problems. As a former user who lost 2 root filesystems to BTRFS, I'm not touching it for a long time.
rcxdude · 5h ago
Many reports of data loss or even complete filesystem loss, often in very straightforward scenarios.
The amount of "mostly OK" and still an "unstable" RAID6 implementation. Not going to trust a file system with "mostly OK" device replace. Anecdotally, you can search the LKML and here for tons of data loss stories.
bombcar · 7h ago
ZFS is good enough for 90% of people who need that so no real money is available for anything new.
Maybe a university could do it.
anonfordays · 7h ago
Indeed, and it's inclusion in Ubuntu is fantastic. It's also showing it's age, 20 years now. Tso, where are you when we need you most!?
bombcar · 7h ago
Or someday a file system will somehow piss off Linus and he’ll write one in a weekend or something ;)
XorNot · 6h ago
I mean, is it? It's a filesystem and it works. How is it "showing its age"?
charcircuit · 12h ago
If Linux would add a stable kernel module API this wouldn't be a huge a problem and it would be easy for bcachefs to ship as a kernel module with his own independent release schedule.
homebrewer · 12h ago
It would also have a lot less FOSS drivers, neither we nor FreeBSD (which is often invoked in these complaints) would have amdgpu for example.
Nextgrid · 9h ago
What's so bad about it? Windows to this day doesn't have FOSS drivers as standard and despite that is pretty successful. In practice, as long as a driver works it's fine for the vast majority of users, and you can always disassemble and binary-patch if really needed.
(it's not obvious that having to occasionally disassemble/patch closed-source drivers is worse than the collective effort wasted trying to get every single thing in the kernel and keep it up to date).
charcircuit · 12h ago
I would actually posture that making it easier to make drivers would actually have the opposite effect and result in more FOSS drivers.
>FreeBSD (which is often invoked in these complaints) would have amdgpu for example.
In such a hypothetical FreeBSD could reimplement the stable API of Linux.
smcameron · 11h ago
No, every gpu vendor out there would prefer proprietary drivers and with a stable ABI, they could do it, and would do, there is no question about it.
I worked for HP on storage drivers for a decade or so, and had their been a stable ABI, HP would have shipped proprietary storage drivers for everything. Even without a stable ABI, they shipped proprietary drivers at considerable effort, compiling for myriad different distro kernels. It was a nightmare, and good thing too, or there wouldn't be any open source drivers.
charcircuit · 9h ago
I never said they wouldn't. Having more and better drivers is a good thing for Linux users. It's okay for proprietary drivers to exist. The kernel isn't meant to be a vehicle to push the free software agenda.
No comments yet
throw0101d · 12h ago
> In such a hypothetical FreeBSD could reimplement the stable API of Linux.
Like it does with the userland API of Linux, which is stable:
It's plenty easy to make drivers now, it's just hard to distribute them without sharing the source.
There is absolutely no good reason not to share driver source though so that's a terrible use case to optimize for.
heavyset_go · 9h ago
The unstable interface is Linux's moat, and IMO, is the reason we're able to enjoy such a large ecosystem of hardware via open source operating systems.
zahlman · 7h ago
I'm afraid I don't follow your reasoning.
heavyset_go · 3h ago
With a stable driver API/ABI, vendors will just dump closed source drivers once and call it a day, or pull a Apple/Sony/Nintendo with FreeBSD, where you effectively get a closed source OS that supports your hardware.
An unstable interface means the driver source needs to be updated frequently, you can't just dump a .ko file online and expect it to work for however long the hardware lasts.
Easiest way to approach it is to attempt to upstream drivers, and potentially take advantage of free labor and maintenance in virtual perpetuity, which is good for all Linux users. If vendors don't want to spend the effort upstreaming drivers, but they need to support Linux, by necessity the drivers must be open source so they can be compiled against users' changing kernels. That's at least a step in the right direction, and should anyone want to make the effort, they're free to upstream drivers themselves.
rcxdude · 5h ago
The interface churn in linux adds a strong incentive (on top of the GPL) to upstream drivers, i.e. publish them as open source. Not doing so tends to mean you get stuck on old versions. If it had a stable interface, hardware vendors would just release crappy binary blobs and they'd only be usable on linux, and not maintainable by anyone else (and hardware vendors don't generally maintain their drivers for long)
josephcsible · 12h ago
The slight benefit for out-of-tree module authors wouldn't be worth the negative effects on the rest of the kernel to everyone else.
charcircuit · 12h ago
"slight benefit"? Having a working system after upgrading your kernel is not just a slight benefit. It's table stakes. Especially for something critical like a filesystem it should never break.
>negative effects on the rest of the kernel
Needing to design and support an API is not purely negative for kernel developers. It also gives a change to have a proper interface for drivers to use and follow. Take a look at the Rust for Linux which keeps running into undocumented APIs that make little sense and are just whatever <insert most popular driver> does.
josephcsible · 12h ago
> Having a working system after upgrading your kernel is not just a slight benefit. It's table stakes.
We already have that, with the "don't break userspace" policy combined with all of the modules being in-tree.
> Needing to design and support an API is not purely negative for kernel developers.
Sure, it's not purely negative, but it's overall a big net negative.
> Take a look at the Rust for Linux which keeps running into undocumented APIs that make little sense and are just whatever <insert most popular driver> does.
That's an argument against a stable module API! Those things are getting fixed as they get found, but if we had a stable module API, we'd be stuck with them forever.
>We already have that, with the "don't break userspace"
Bcachefs is not user space.
>with all of the modules being in-tree.
That is not true. There are out of tree modules such as ZFS.
>That's an argument against a stable module API!
My point was that there was 0 thought put into creating a good API. Additionally API could be evolved over time and have a support period if you care about being able to evolve it and deprecate the old one. And likely even with a better interface there is probably a way to make the old API still function.
josephcsible · 12h ago
> Bcachefs is not user space.
bcachefs is still in-tree.
> That is not true. There are out of tree modules such as ZFS.
ZFS could be in-tree in no time at all if Oracle would fix its license. And until they do that, it's not safe to use ZFS-on-Linux anyway, since Oracle could sue you for it.
> My point was that there was 0 thought put into creating a good API.
There is thought put into it: it's exactly what we need right now, because if what we need ever changes, we'll change the API too, thus avoiding YAGNI and similar problems.
> Additionally API could be evolved over time and have a support period if you care about being able to evolve it.
If a temporary "support period" is what you want, then just use the LTS kernels. That's already exactly what they give you.
> And likely even with a better interface there is probably a way to make the old API still function.
That's the big net negative I was mentioning and that https://docs.kernel.org/process/stable-api-nonsense.html talks about too. Sometimes there isn't a feasible way to support part of an old API anymore, and it's not worth holding the whole kernel back just for the out-of-tree modules.
yjftsjthsd-h · 10h ago
> ZFS could be in-tree in no time at all if Oracle would fix its license. And until they do that, it's not safe to use ZFS-on-Linux anyway, since Oracle could sue you for it.
IANAL, but I don't believe either of these things are true.
OpenZFS contains enough code not authored by Sun/Oracle that relicensing it now is effectively impossible.
OTOH, it is under the CDDL, which is a perfectly good open source license; AFAICT the problem, if one exists at all[0], only manifests when distributing the combination of CDDL (OpenZFS) and GPL (Linux) software. If you download CDDL software and compile it into GPL software yourself (say, with DKMS) then it should be fine because you aren't distributing it.
[0] This is a case where I'm going to really emphasize that I'm really not a lawyer and merely point out that ex. Canonical's lawyers do seem to think CDDL+GPL is okay.
timschmidt · 10h ago
> it should be fine because you aren't distributing it.
Which excludes a vast amount of activity one might want to use Linux for which is otherwise allowed. Like selling a device with a Linux installation, distributing VM or system restore images, etc.
yjftsjthsd-h · 8h ago
Sure, I happily grant that the licensing situation is really annoying and restricts the set of safe actions. I only object to claims that all use of ZFS is legally risky.
skissane · 2h ago
> OpenZFS contains enough code not authored by Sun/Oracle that relicensing it now is effectively impossible.
I don't think so. Suppose Oracle did agree to put their code under GPLv2/CDDL dual licensing.
Then, I'm sure if you look at the non-Oracle contributors to OpenZFS, there's a few big ones and a long tail of smaller ones. Many of the big ones might be able and willing to follow Oracle's lead. Chasing down the smaller ones may be harder, but it is possible their contributions may be judged as sufficiently trivial to escape copyright protection. More substantive contributions from people who are unreachable (or unwilling/unable to consent to the relicensing) can pose a bigger issue, but it could be solved either by (a) intentionally rewriting their contributions from scratch; (b) given enough time, decent chance (a) will happen anyway just to normal code churn, even if you don't do it intentionally for licensing reasons.
It would be a big, multi-year project, but one that other open source communities have successfully tackled, most notably LLVM – so I do think "effectively impossible" is too strong.
I think the biggest blocker is that, it is hard to motivate people to make the effort unless Oracle is on-board – and they've displayed no signs of willingness to change their position on this. I doubt Oracle will budge, but anything is possible.
Another possibility to consider – CDDL clause 4 allows the "license steward" (Sun Microsystems) [0] to release a new version, which automatically applies to all CDDL software unless the developers explicitly opt-out. I don't know if any of the OpenZFS developers have made such an explicit opt-out – but if they haven't, then Oracle could issue a new CDDL version adding a clause saying that if the covered work is ZFS or a derivative thereof, anyone is allowed to relicense it under GPLv2. Then you wouldn't even need to track down and get the consent of non-Oracle contributors. For a real historical example of something like this, witness how the FSF issued a new GFDL version just to help Wikipedia move from GFDL to Creative Commons licensing. But, again, even if this is legally possible, unlikely (but not impossible) Oracle will ever cooperate in it.
Another blocker is that even if OpenZFS were relicensed as GPLv2/CDDL, that still wouldn't solve the issue that Torvalds is unlikely to agree to upstreaming it as part of the mainline Linux kernel – a massive code base written in a very different style, and having portability concerns (wanting to work on BSD/etc too) which Linux normally doesn't care about. Possibly if you forked OpenZFS, ripped out the cross-platform aspects, and rewrote it to be more like typical Linux kernel code, it might have a chance. But, will anyone be willing to make that massive investment of time and effort? And even assuming they succeeded, we'd now have two forks of ZFS (one in the Linux kernel, one for other operating systems), adding to the maintenance burden, and the risk they'd diverge over time would be high.
[0] Sun Microsystems still legally exists on paper, and probably will indefinitely, as an Oracle subsidiary – it has been renamed to Oracle America Inc – so Oracle has effectively inherited Sun's rights as CDDL license steward
mustache_kimono · 4h ago
> And until they do that, it's not safe to use ZFS-on-Linux anyway, since Oracle could sue you for it.
This is clearly untrue. Upon what theory?
charcircuit · 8h ago
>it's not safe to use ZFS-on-Linux anyway, since Oracle could sue you for it.
It's not against the license to use them together.
>If a temporary "support period" is what you want, then just use the LTS kernels. That's already exactly what they give you.
Only the Android one does. The regular LTS one has no such guarantee.
msgodel · 11h ago
Does your system have some critical out of tree driver? That should have been recompiled with the new kernel, that sounds like a failure of whoever maintains the driver/kernel/distro (which may be you if you're building it yourself.)
Kent is in the wrong. Having a lead position in development I would kick Kent of the team.
One thing is to challenge things. What Kent is doing is something completely different. It is obvious he introduced a feature, not only a Bugfix.
If the rules are set in a way that rc1+ gets only Bugfixes, then this is absolutely clear what happens with the feature. Tolerating this once or twice is ok, but Kent is doing this all the time, testing Linus.
Linus is absolutely in the right to kick this out and it's Kent's fault if he does so.
It's a damn shame too because bcachefs has some unique features/potential
It would be interesting how strict the rules are in the Linux kernel for other people. Other projects have nepotistic structures where some developers can do what they want but others cannot.
Anyway, if Linus had developed the kernel with this kind of strictness from the beginning, maybe it wouldn't have taken off. I don't see why experimental features should follow the rules for stable features.
Ideally they'd be developed outside the kernel until they are perfect, but Kent addresses this in his LWN comment: There is no funding/time to make that ideal scenario possible.
If you’re using experimental file systems, I’d expect you to be pretty competent in being able to hold your own in a storage emergency, like compiling a kernel if that’s the way out.
This is a made up emergency, to break the rules.
Like the sibling commenter, I suspect the word “experimental” is being used here to try and evade the rules that, somehow, every other part of the kernel manages to do just fine with.
I have no context on the situation or understanding of what the right set of rules here is, but the difference between filesystems and other code is that bugs in the filesystem can cause silent, persistent corruption for user data, on top of all the other failure modes. Most other parts of the kernel don't have such a large persistence capability in case of failure. So I can understand if filesystems feel the need to be special.
We're very far along in that process now, but it's still marked as experimental because it is not quite ready for widespread deployment by just anyone. 6.16 is getting damn close, though.
That means a lot of our users now are people getting it from distro kernels, who often have never compiled a kernel before - nevertheless, they can and do report bugs.
And no matter where you are in the rollout, when you get bug reports you have to fix them and get the fixes out to users in a timely manner so that they can keep running, keep testing and reporting bugs.
It's a big loss if a user has to wait 3 months for a bugfix - they'll get frustrated and leave, and a big part of what I do is building a community that knows how the system works, how to help debug, and how to report those bugs.
A very common refrain I get is "it's experimental <expletive deleted>, why does it matter?" - and, well, the answer is getting fixes out in a timely manner matters just as much if not more if we want to get this thing done in a timely manner.
It seems like you want that kind of fast turn around time in the kernel though, which strikes me as an impossible goal.
At the time it looked like that could happen - there was real interest from Redhat prior to merging. Sadly Redhat's involvement never translated into much code, and while upstreaming did get me a large influx of users - many of which have helped enormously with the QA and stabilization effort - the drama and controversies have kept developers away, so on the whole it's meant more work, pressure and stress for me.
So DKMS wouldn't be the worst route, at this point. It would be a real shame though, this close to taking the experimental label off, and an enormous hassle for users and distributions.
But it's not my call to make, of course. I just write code...
I've seen this sentiment a lot lately. That disagreeable top performers have to be disposed of because they are "toxic" or "problematic".
You aren't doing your job as a leader if this is your attitude to good engineers. Engineering is a field where a small amount of the people create a large amount of the value. You can either understand that, and take it upon yourself to integrate disagreeable yet high performing people into the team, paving over the rough patches yourself. Or you can oust them, and quite literally take a >50% productivity hit on your team.
A disagreeable person will take up more of your time as a manager, but a high performer is worth significantly more of your time. When these traits co-occur in the same person, the cost-benefit is complicated. The reason we talk about this problem a lot in tech is because it is legitimately a tough call, with errors in both directions. Wishing that the right move was always as simple as kicking someone off the team doesn't make it true, although it may relieve you from having to contend with the decision.
In a short term, possibly. But do you think bcachefs is better in the current situation than if it moved at half the speed, but without conflict? By being out of kernel it will get less testing, fewer contributions, the main developer will get some time wasted on rebasing the patch set with every release, and distros are unlikely to expose bcachefs to the user any time soon. When you're working with an ecosystem / teams, single person's performance really doesn't mean that much in the end. And occasionally Kent will still have to upstream some changes to interfaces - how likely is anyone to review/merge them quickly?
And now, what are the chances this will ever become more than a single person project really?
> When you're working with an ecosystem / teams, single person's performance really doesn't mean that much in the end.
This is demonstrably not true. Kent brought Bcachefs to fruition and got it upstreamed. Wireguard was also one guy. The cryptography used in both, also 1 guy. There's an argument to be made that given an elegant, well designed system, we should assume it came from a single or a few minds. But given a system that's been around for a while, you would be right to assume that a lot of people were/are involved in keeping it around.
Ideally, you teach people how to get along better together; I think of my job as manager (and I effectively do manage a large team these days) as one of teaching and fostering good communication.
Software take up no space and there is no scarcity. Theoretically there could be any number of maintainers and what gets uptake is the de facto upstream. That's what people refer to when they talk about free software development in terms of meritocracy.
A way to handle this would be with one person (or more) in between Kent and Linus. And maybe a separate tree only for changes and fixes from bcachefs that those people in between would forward to Linus. A staging of sorts.
My job matches what you’re describing, but bug fix is widely interpreted. It basically means “managers don’t do anything stupid”.
If someone got in trouble using language you are desiring “the rules are clear and were broken”, i would feel they were singling someone out.
That means Linus has to check each of his PRs assuming that it might be pushing the boundaries without warning.
No amount of post hoc justification gets you that trust back, not when this has happened multiple times now.
https://lore.kernel.org/linux-fsdevel/4xkggoquxqprvphz2hwnir...
The PR description would have been fine - if it had been in the right stage of the process.
However, Kent, if you read this: please just settle down and follow the rules. Quit deliberately antagonizing Linus. The constant drama is incredibly offputting. Don't jeopardize the entire future of bcachefs over the silliest and most temporary concerns.
If you absolutely must argue about some rule or other, then make that argument without having your opening move be to blatantly violate them and then complain when people call you out.
You were the one who wanted into the kernel despite many suggestions that it was too early. That comes with tradeoffs. You need to figure out how to live with that, at least for a year or two. Stop making your self-imposed problems everyone else's problems.
Kent is absolutely technically capable of, and has the vision to, finally displace ext4, xfs, and zfs with a new filesystem that Does Not Lose Data. To jeopardize that by refusing to work within the well-established structure is madness.
The objection is the tiniest bug-fix windows get everything but the kitchen sink.
These are both uncomfortable positions to occupy, without doubt.
And the whole reason for a filesystem's existence is to store and maintain your data, so if that is what the patch if for, yes, it should be under consideration as a hotfix.
There's also the broader context: it's a major problem for stabilization if we can't properly support the people using it so they can keep testing.
More context: the kernel as a whole is based on fixed time tables and code review, which it needs because QA (especially automated testing) is extremely spotty. bcachefs's QA, both automated testing and community testing, is extremely good, and we've had bugfix patchsets either held up or turn into flamewars because of this mismatch entirely too many times.
At work we have our main application which also contains a lot of customer integrations. Our policy has been new features in trunk only, except if it's entirely contained inside a customer-specific integration module.
We do try to avoid it, but this does allow us to be flexible with regards to customer needs, while keeping the base application stable.
This new recovery feature was, as far as I could see, entirely contained within the bcachefs kernel code. Given the experimental status, as long as it was clearly communicated to users, I don't see a huge problem allowing such self-contained features during the RC phase.
Obviously a requirement must be that it doesn't break the build.
No comments yet
Then what you do is you try to split your work in two. You could think of a stopgap measure or a workaround which is small, can be reviewed easily, and will reduce the impact of the bug while not being a "proper" fix, and prepare the "properer" fix when the merge window opens.
I would ask, since the bug probably lived since the last stable release, how come it fell through the crack and had only been noticed recently? Could it be that not all setups are affected? If so, can't they live with it until the next merge window?
By making a "feature that fixes the bug for real", you greatly expand the area in which new, unknown bugs may land, with very little time to give it proper testing. This is inevitable, evident by the simple fact that the bug you were trying to fix exists. You can be good, but not that good. Nobody is that good. If anybody was that good, they wouldn't have the bug in the first place.
If you have commercial clients who use your filesystem and you have contractual obligations to fix their bugs and keep their data intact, you could (I'd even say "should") maintain an out-of-tree version with its own release and bugfix schedule. This is IMO the only reasonable way to have it, because the kernel is a huge administrative machine with lots of people, and by mainlining stuff, you necessarily become co-dependent on the release schedule for the whole kernel. I think a conflict between kernel's release schedule and contractual obligations, if you have any, is only a matter of time.
That is indeed what I normally do. For example, 6.14 and 6.15 had people discovering btree iterator locking bugs (manifesting as assertion pops) while running evacuates on large filesystems (it's hard to test a sufficiently deep tree depth in virtual machine tests with our large btree nodes); some small hotfixes went out in rc kernels, but the majority of the work (a whole project to add assertions for path->should_be_locked, which should shut these down for good) waited until the 6.16 merge window.
That was for a less critical bug - your machine crashing is somewhat less severe than losing a filesystem.
In this case, we had a bug pop up in 6.15 where the link count in the VFS inode getting screwed up caused an inode to be deleted that shouldn't have been - a subvolume root - and then an untested repair path took out the entire subvolume.
Ouuuuch.
That's why the repair code was rushed; it had already gotten one filesystem back, and I'd just gotten another report of someone else hitting it - and for every bug report there are almost always more people who hit it and don't report it.
And considering that a lot of people running bcachefs now are getting it from distro kernels and don't know how to build kernels - that is why it was important to get this out quickly through the normal channels.
In addition, the patch wasn't risky, contrary to what Ted was saying. It's a code path that's very well covered by automated tests, including KASAN/UBSAN/lockdep variants - those would exploded if this patch was incorrect.
When to ship a patch is always a judgement call, and part of how you make that call is how well your QA process can guarantee the patch is correct. Part of what was going on here is a disconnect between those of us who do make heavy use of modern QA infrastructure and those who do it the old school way, relying heavily on manual review and long testing periods for rc kernels.
I think the performance cost of FUSE compared to in-kernel filesystems is improving with time - FUSE with io_uring is a big step forward, but the immaturity of io_uring is an obstacle to its adoption, at least in the short-to-medium-term. I’m sure in the future we’ll see even further improvements in this area. But will we ever reach the Nirvana where FUSE equals the performance of in-kernel filesystems, or (maybe more realistically) the performance overhead has become so marginal nobody is bothered by it in practice? I’d like to think we eventually will, but it is far from certain.
Direct IO would be slower via FUSE, but L4 style IPC could solve that.
It would be an interesting proposition, although not my first choice for the direction I want to go in :)
By contrast, improvements in performance of FUSE, L4-style IPC, could be much more widely beneficial-both for developers of new physical filesystems (by making possible in-user space implementations where they can iterate faster, get better API/ABI stability, easier adoption by end-users), but also for developers of numerous other pieces of software too
Of course, you personally are going to scratch the itch you want to scratch. But in terms of what’s most beneficial for the Linux ecosystem as a whole, I think FUSE improvements and L4-style IPC would deliver the most benefit per unit of effort
Comparatively, I find subtweeting him from the sanctity of Mastodon, with a few insults and backhanded complements thrown in for good measure, a bit low.
There is always a point where you have to say "no I can't work with this person any more", but while you are still trying to it's worth trying to figure out why someone is behaving as they do.
Agree, plus I’d add: if we are going to criticise other people’s communication style/abilities or attitude, then using a vague, vulgar and hurtful slang term like “arsehole”/“asshole” (and similar slang such as “dick”, “prick”, etc) is an example of exhibiting the very thing one is complaining about in making the complaint, which is fundamentally hypocritical. One can state the same concerns in a more professional way, focusing on the details of the specific behaviour pattern not a vague term which can refer to lots of distinct behaviours (e.g. people with ASD traits who hurt the feelings of others because they honestly have trouble thinking about them, versus people with antisocial or narcissistic personality disorder traits who knowingly hurt the feelings of others because they enjoy doing so) - labelling the behaviour pattern not the person, acknowledging that it is entirely possibly due to an unintentional skills gap, (sub)culture clash, differences in life experiences, neurodiversity/neurodivergence/mental health/trauma, etc.
I also think it is helpful when criticising the flaws of others to try to relate them to one’s own, whenever possible - e.g. sometimes in the past I did X and from my perspective it looks like you are doing something similar-hurtful labels are not encouraging that kind of self-reflectiveness at all, they promote the idea that “I’m one of the good ones but you are one of the bad ones”
bCacheFS, not BCA Chefs. I’m not clued into the kernel at this level so I racked my brain a bit.
By contrast, I don't know what "BCA Chefs" is supposed to be. "BCA" could be many things: "Barbados Cricket Association", "Billiard Congress of America", "British Caving Association", "Business Council of Australia", among others. But what would any of them have to do with "Chefs"?
And no, it's not necessary even for root file systems. Linux can load modules, such as file system drivers, before it mounts root. That's what initramfs is about.
ZFSoL has thrived for 15 years, fostering several commercial empires, and has never been in Linus's mainline.
Bcachefs development may continue as Kent Overstreet wishes, and he need not squabble with Linus going forward. Seems like an entirely workable outcome. Kudos to Linus for a.) giving Kent a chance, despite known issues with Kent, and b.) making the difficult decision to reverse himself. Both of these decisions were correct.
What I learn from all of this is that Linus is still in the saddle and still making good calls. We are blessed.
First niche is the full featured CoW filesystem; it has snapshots, detects and repairs corruption, transparent compression, all that good stuff.
The other niche is being an allocator of sectors. There's one storage device, divide it up amongst all these processes asking for storage. That's Ext4: an allocator of disk sectors, dressed up in a filesystem API. When you are running databases or VMs, all you want is an allocator of sectors. You don't want lots of stuff getting in the way of your writes. You don't want checksumming, you don't want your writes going to a new place every time. You just want write access to part of the disk.
No comments yet
The filesystem should do files, if you want something more complex do it in userspace. We even have FUSE if you want to use the Filesystem API with your crazy network database thing.
You get that with APFS by default on macOS these days and those features come for free in btrfs, some in XFS, etc on Linux.
That makes checksums and journals of only marginal usefulness.
I wish some review website would have a robot plug and unplug the power cable in a test rig for a few weeks and rate which SSD manufacturers are robust to this stuff.
Anything unimportant is really just being cached and it's probably fine if it gets corrupted.
That's pretty much built into most mass storage devices already.
> If a disk bitflips one of my files
The likelihood and consequence of this occurring is in many situations not worth the overhead of adding additional ECC on top of what the drive does.
> ext* won't do anything about it.
What should it do? Blindly hand you the data without any indication that there's a problem with the underlying block? Without an fsck what mechanism do you suppose would manage these errors as they're discovered?
> What should it do? Blindly hand you the data without any indication that there's a problem with the underlying block?
Well, that's what it does now, and I think that's a problem.
> Without an fsck what mechanism do you suppose would manage these errors as they're discovered?
Linux can fail a read, and IMHO should do so if it cannot return correct data. (I support the ability to override this and tell it to give you the corrupted data, but certainly not by default.) On ZFS, if a read fails its checksum, the OS will first try to get a valid copy (ex. from a mirror or if you've set copies=2), and then if the error can't be recovered then the file read fails and the system reports/records the failure, at which point the user should probably go do a full scrub (which for our purposes should probably count as fsck) and restore the affected file(s) from backup. (Or possibly go buy a new hard drive, depending on the extent of the problem.) I would consider that ideal.
> That's pretty much built into most mass storage devices already.
And ZFS has shown that it is not sufficient (at least for some use-cases, perhaps less of a big deal for 'residential' users).
> The likelihood and consequence of this occurring is in many situations not worth the overhead of adding additional ECC on top of what the drive does.
Not worth it to whom? Not having the option available at all is the problem. I can do a zfs set checksum=off pool_name/dataset_name if I really want that extra couple percentage points of performance.
> Without an fsck what mechanism do you suppose would manage these errors as they're discovered?
Depends on the data involved: if it's part of the file system tree metadata there are often multiple copies even for a single disk on ZFS. So instead of the kernel consuming corrupted data and potentially panicing (or going off into the weeds) it can find a correct copy elsewhere.
If you're in a fancier configuration with some level of RAID, then there could be other copies of the data, or it could be rebuilt through ECC.
With ext*, LVM, and mdadm no such possibility exists because there are no checksums at any of those layers (perhaps if you glom on dm-integrity?).
And with ZFS one can set copies=2 on a per-dataset basis (perhaps just for /home?), and get multiple copies strewn across the disk: won't save you from a drive dying, but could save you from corruption.
I looked at that, in hopes of being able to protect my data. Unfortunately, I considered this something of a fatal flaw:
> It uses journaling for guaranteeing write atomicity by default, which effectively halves the write speed.
- https://wiki.archlinux.org/title/Dm-integrity
Which implies you can already correct errors through a simple majority mechanism.
> or it could be rebuilt through ECC.
So just by having the appropriate level of RAID you automatically solve the problem. Why is this in the fs layer then?
I don't think so? You set copies=2, and the disk says that your file starts with 01010101, except that the second copy says your file starts with 01010100. How do you tell which one is right? For that matter, even with only one copy ex. ZFS can tell that what it has is wrong even if it can't fix it, and flagging the error is still useful.
> So just by having the appropriate level of RAID you automatically solve the problem. Why is this in the fs layer then?
Similarly, you shouldn't need RAID to catch problems, only (potentially) to correct them. I do agree that it doesn't necessarily have to be in the FS layer, but AFAIK Linux doesn't have any other layers that do a good job of it (as mentioned above, dm-integrity exists but halving the write speed is a pretty big problem).
That's 10^14 bits for a consumer drive. That's just 12TB. A heavy user (lots of videos or games) would see a bit flip a couple times a year.
According to that 10^14 metric I should see read errors just about every month. Except I have just about zero.
Current disks are ~4 years, runs 24/7, and excluding a bad cable incident I've had a single case of a read error (recoverable, thanks ZFS).
I suspect those URE numbers are made by the manufacturers figuring out they can be sure the disk will do 10^14, but they don't actually try to find the real number because 10^14 is good enough.
And before that I have been using 8x WD Reds 3TB for 6-7 years, which have 10^14 in the specs[2], and had the same experience with those.
Yes smaller size, but I ran scrubbing on those biweekly, and over so many years?
[1]: https://www.seagate.com/files/www-content/datasheets/pdfs/ir...
[2]: https://documents.westerndigital.com/content/dam/doc-library...
The extended (not extant) family (including ext4) don't support copy-on-write. Using them as your primary FS after 2020 (or even 2010) is like using a non-journaling file system after 2010 (or even 2001)--it's a non-negotiable feature at this point. Btrfs has been stable for a decade, and if you don't like or trust it, there's always ZFS, which has been stable 20 years now. Apple now has AppFS, with CoW, on all their devices, while MSFT still treats ReFS as unstable, and Windows servers still rely heavily on NTFS.
What kind of journaling though? By default ext4 only uses journaling for metadata updates, not data updates (see "ordered" mode in ext4(5)).
So if you have a (e.g.) 1000MB file, and you update 200MB in the middle of it, you can have a situation where the first 100MB is written out and the system dies with the other 100MB vanishing.
With a CoW, if the second 100MB is not written out and the file sync'd, then on system recovery you're back to the original file being completely intact. With ext4 in the default configuration you have a file that has both new-100MB and stale-100MB in the middle of it.
The updating of the file data and the metadata are two separate steps (by default) in ext4:
* https://www.baeldung.com/linux/ext-journal-modes
* https://michael.kjorling.se/blog/2024/ext4-defaulting-to-dat...
* https://fy.blackhats.net.au/blog/2024-08-13-linux-filesystem...
Whereas with a proper CoW (like ZFS), updates are ACID.
Being able to quickly take a "backup" copy of some multi-gb directory tree before performing some potentially destructive operation on it is such a nice safety net to have.
It's also a handy way to backup file metadata, like mtime, without having to design a file format for mapping saved mtimes back to their host files.
You're thinking of the optimization technique of CoW, as in what Linux does when spawning a new thread or forking a process. I'm talking about it in the context of only ever modifying copies of file system data and metadata blocks, for the purpose of ensuring file system integrity, even in the context of sudden power loss (EDIT: wrong link): https://www.qnx.com/developers/docs/8.0/com.qnx.doc.neutrino...
If anything, ordinary file IO is likely to be slightly slower on a CoW file system, due to it always having to copy a block before said block can be modified and updating block pointers.
I read that more as "we have filesystems at home, and also get off my lawn".
However, there is also a long time since XFS supports snapshots.
See for example:
https://thelinuxcode.com/xfs-snapshot/
I am not sure what you mean by "whole-volume" snapshots, but I have not noticed any restrictions in the use of the XFS snapshots. As expected, they store a snapshot of the entire file system, which can be restored later.
In many decades of managing computers with all kinds of operating systems and file systems, on a variety of servers and personal computers, I have never had the need to shrink a file system. I cannot imagine how such a need can arise, except perhaps as a consequence of bad planning. There are also many decades since I have deprecated the use of multiple partitions on a storage device, with the exception of bootable devices, which must have a dedicated partition for booting, conforming to the BIOS or UEFI expectations. For anything that was done in the ancient times with multiple partitions there are better alternatives now. With the exception of bootable USB sticks with live Linux or FreeBSD partitions, I use XFS on whole SSDs or HDDs (i.e. unpartitioned), regardless if they are internal or external, so there is never any need for changing the size of the file system.
Even so, copying a file system to an external device, reformatting the device and copying the file system back is not likely to be significantly slower than shrinking in place. In fact sometimes it can be faster and it has the additional benefit that the new copy of the file system will be defragmented.
Much more significant than the lack of shrinking ability, which may slow down a little something that occurs very seldom, is that both EXT4 and XFS are much faster for most applications than the other file systems available for Linux, so they are fast for the frequent operations. You may choose another file system for other reasons, but choosing it for making faster a very rare operation like shrinking is a very weak reason.
They seem to be slowly introducing it to the masses, Dev drives you set up on Windows automatically use ReFS
Speed is sometimes more important than absolute reliability, but it’s still an undesirable tradeoff.
... It does hard links? After checking: It does hard links.
... Why didn't any programs I had noticeably take advantage of that?
No, it doesn't. Maybe you're thinking of shadow volume copies or something else. CoW files systems never modify data or metadata blocks directly, only modifying copies, with the root of the updated block pointer graph only updated after all other changes have been synced. Read this: https://www.qnx.com/developers/docs/8.0/com.qnx.doc.neutrino...
I was asking, because didn't know, and I thought the other person was implying that it did.
I know what copy-on-write is.
> while MSFT still treats ReFS as unstable, and Windows servers still rely heavily on NTFS.
By this I implied it's an embarrassment to MSFT that iOS devices have a better, more reliable file system (AppFS) than even Windows servers now (having to rely on NTFS until ReFS is ready for prime time). If HN users and mods didn't tone-police so heavily, I could state things more frankly.
And I encourage anyone who wants to contribute to join the IRC channel. It's not a one man show, I work with a lot of people there.
https://www.youtube.com/@SavvyNik/videos
He gets a few words wrong because my understanding is he covers the topic in a more broad way, but most of his coverage seems objective and factual. He does have some opinions, but I think it's closer to journalism of the LKML than an opinion piece.
I know a lot of people heavily use slack/discord these days, but personally I find the web interfaces way too busy. IRC all the way, for me.
But the problem of communicating effectively enough to produce a coherent design is very real - this goes back to Fred Brooks (Mythical Man Month). I think bcachefs turned out very well with the way the process has gone to date, and now that it's gotten bigger, with more distinct subsystems, I am very eagerly looking forward to the date when I can hand off ownership of some of those subsystems. Lately we've had some sharp developers getting involved - for the past several years it's been mainly users testing it (and some of them have gotten very good at debugging at this point).
So it's happening.
I'm not saying those concerns are wrong, but when it's causing a fallout like being kicked out from the kernel, the downsides clearly are more severe than any potential benefits.
Opens LKML archive hoping for another Linus rant.
While you should have a backup of your data anyway.
Apparently bcachefs won't be the successor. Filesystem development for Linux needs a big shakeup.
The amount of "mostly OK" and still an "unstable" RAID6 implementation. Not going to trust a file system with "mostly OK" device replace. Anecdotally, you can search the LKML and here for tons of data loss stories.
Maybe a university could do it.
(it's not obvious that having to occasionally disassemble/patch closed-source drivers is worse than the collective effort wasted trying to get every single thing in the kernel and keep it up to date).
>FreeBSD (which is often invoked in these complaints) would have amdgpu for example.
In such a hypothetical FreeBSD could reimplement the stable API of Linux.
I worked for HP on storage drivers for a decade or so, and had their been a stable ABI, HP would have shipped proprietary storage drivers for everything. Even without a stable ABI, they shipped proprietary drivers at considerable effort, compiling for myriad different distro kernels. It was a nightmare, and good thing too, or there wouldn't be any open source drivers.
No comments yet
Like it does with the userland API of Linux, which is stable:
* https://wiki.freebsd.org/Linuxulator
There is absolutely no good reason not to share driver source though so that's a terrible use case to optimize for.
An unstable interface means the driver source needs to be updated frequently, you can't just dump a .ko file online and expect it to work for however long the hardware lasts.
Easiest way to approach it is to attempt to upstream drivers, and potentially take advantage of free labor and maintenance in virtual perpetuity, which is good for all Linux users. If vendors don't want to spend the effort upstreaming drivers, but they need to support Linux, by necessity the drivers must be open source so they can be compiled against users' changing kernels. That's at least a step in the right direction, and should anyone want to make the effort, they're free to upstream drivers themselves.
>negative effects on the rest of the kernel
Needing to design and support an API is not purely negative for kernel developers. It also gives a change to have a proper interface for drivers to use and follow. Take a look at the Rust for Linux which keeps running into undocumented APIs that make little sense and are just whatever <insert most popular driver> does.
We already have that, with the "don't break userspace" policy combined with all of the modules being in-tree.
> Needing to design and support an API is not purely negative for kernel developers.
Sure, it's not purely negative, but it's overall a big net negative.
> Take a look at the Rust for Linux which keeps running into undocumented APIs that make little sense and are just whatever <insert most popular driver> does.
That's an argument against a stable module API! Those things are getting fixed as they get found, but if we had a stable module API, we'd be stuck with them forever.
I recommend reading https://docs.kernel.org/process/stable-api-nonsense.html
Bcachefs is not user space.
>with all of the modules being in-tree.
That is not true. There are out of tree modules such as ZFS.
>That's an argument against a stable module API!
My point was that there was 0 thought put into creating a good API. Additionally API could be evolved over time and have a support period if you care about being able to evolve it and deprecate the old one. And likely even with a better interface there is probably a way to make the old API still function.
bcachefs is still in-tree.
> That is not true. There are out of tree modules such as ZFS.
ZFS could be in-tree in no time at all if Oracle would fix its license. And until they do that, it's not safe to use ZFS-on-Linux anyway, since Oracle could sue you for it.
> My point was that there was 0 thought put into creating a good API.
There is thought put into it: it's exactly what we need right now, because if what we need ever changes, we'll change the API too, thus avoiding YAGNI and similar problems.
> Additionally API could be evolved over time and have a support period if you care about being able to evolve it.
If a temporary "support period" is what you want, then just use the LTS kernels. That's already exactly what they give you.
> And likely even with a better interface there is probably a way to make the old API still function.
That's the big net negative I was mentioning and that https://docs.kernel.org/process/stable-api-nonsense.html talks about too. Sometimes there isn't a feasible way to support part of an old API anymore, and it's not worth holding the whole kernel back just for the out-of-tree modules.
IANAL, but I don't believe either of these things are true.
OpenZFS contains enough code not authored by Sun/Oracle that relicensing it now is effectively impossible.
OTOH, it is under the CDDL, which is a perfectly good open source license; AFAICT the problem, if one exists at all[0], only manifests when distributing the combination of CDDL (OpenZFS) and GPL (Linux) software. If you download CDDL software and compile it into GPL software yourself (say, with DKMS) then it should be fine because you aren't distributing it.
[0] This is a case where I'm going to really emphasize that I'm really not a lawyer and merely point out that ex. Canonical's lawyers do seem to think CDDL+GPL is okay.
Which excludes a vast amount of activity one might want to use Linux for which is otherwise allowed. Like selling a device with a Linux installation, distributing VM or system restore images, etc.
I don't think so. Suppose Oracle did agree to put their code under GPLv2/CDDL dual licensing.
Then, I'm sure if you look at the non-Oracle contributors to OpenZFS, there's a few big ones and a long tail of smaller ones. Many of the big ones might be able and willing to follow Oracle's lead. Chasing down the smaller ones may be harder, but it is possible their contributions may be judged as sufficiently trivial to escape copyright protection. More substantive contributions from people who are unreachable (or unwilling/unable to consent to the relicensing) can pose a bigger issue, but it could be solved either by (a) intentionally rewriting their contributions from scratch; (b) given enough time, decent chance (a) will happen anyway just to normal code churn, even if you don't do it intentionally for licensing reasons.
It would be a big, multi-year project, but one that other open source communities have successfully tackled, most notably LLVM – so I do think "effectively impossible" is too strong.
I think the biggest blocker is that, it is hard to motivate people to make the effort unless Oracle is on-board – and they've displayed no signs of willingness to change their position on this. I doubt Oracle will budge, but anything is possible.
Another possibility to consider – CDDL clause 4 allows the "license steward" (Sun Microsystems) [0] to release a new version, which automatically applies to all CDDL software unless the developers explicitly opt-out. I don't know if any of the OpenZFS developers have made such an explicit opt-out – but if they haven't, then Oracle could issue a new CDDL version adding a clause saying that if the covered work is ZFS or a derivative thereof, anyone is allowed to relicense it under GPLv2. Then you wouldn't even need to track down and get the consent of non-Oracle contributors. For a real historical example of something like this, witness how the FSF issued a new GFDL version just to help Wikipedia move from GFDL to Creative Commons licensing. But, again, even if this is legally possible, unlikely (but not impossible) Oracle will ever cooperate in it.
Another blocker is that even if OpenZFS were relicensed as GPLv2/CDDL, that still wouldn't solve the issue that Torvalds is unlikely to agree to upstreaming it as part of the mainline Linux kernel – a massive code base written in a very different style, and having portability concerns (wanting to work on BSD/etc too) which Linux normally doesn't care about. Possibly if you forked OpenZFS, ripped out the cross-platform aspects, and rewrote it to be more like typical Linux kernel code, it might have a chance. But, will anyone be willing to make that massive investment of time and effort? And even assuming they succeeded, we'd now have two forks of ZFS (one in the Linux kernel, one for other operating systems), adding to the maintenance burden, and the risk they'd diverge over time would be high.
[0] Sun Microsystems still legally exists on paper, and probably will indefinitely, as an Oracle subsidiary – it has been renamed to Oracle America Inc – so Oracle has effectively inherited Sun's rights as CDDL license steward
This is clearly untrue. Upon what theory?
It's not against the license to use them together.
>If a temporary "support period" is what you want, then just use the LTS kernels. That's already exactly what they give you.
Only the Android one does. The regular LTS one has no such guarantee.