TernFS – An exabyte scale, multi-region distributed filesystem

80 rostayob 16 9/18/2025, 2:36:44 PM xtxmarkets.com ↗

Comments (16)

eigenvalue · 6m ago
This sounds like it would be a good underpinning for a decentralized blockchain file storage system with its focus on immutability and redundancy.
mrbluecoat · 1h ago
Cool project and kudos for open sourcing it. Noteworthy limitation:

> TernFS should not be used for tiny files — our median file size is 2MB.

jandrewrogers · 37m ago
I have worked on exabyte-scale storage engines, there is a good engineering reason for this type of limitation.

If you had 1 KiB average file size then you have quadrillions of metadata objects to quickly search and manage with fine-granularity. The kinds of operations and coordination you need to do with metadata is difficult to achieve reliably when the metadata structure itself is many PB in size. There are many interesting edge cases that show up when you have to do deep paging of this metadata off of storage. Making this not slow requires unorthodox and unusual design choices that introduce a lot of complexity. Almost none of the metadata fits in memory, including many parts of conventional architectures we assume will easily fit in memory.

A mere trillion objects is right around the limit of where the allocators, metadata, etc can be made to scale with heroic efforts before conventional architectures break down and things start to become deeply weird on the software design side. Storage engines need to be reliable, so avoiding that design frontier makes a lot of sense if you can avoid it.

It is possible to break this barrier but it introduces myriad interesting design and computer science problems for which there is little literature.

heipei · 1h ago
Yeah, that was the first thing I checked as well. Being suited for small / tiny files is a great property of the SeaweedFS system.
pandemic_region · 1h ago
What happens if you put a tiny file on it then? Bad perf, possible file corruption, ... ?
jleahy · 1h ago
It's just not optimised for tiny files. It absolutely would work with no problems at all, and you could definitely use it to store 100 billion 1kB files with zero problems (and that is 100 terabytes of data, probably on flash, so no joke). However you can't use it to store 1 exabyte of 1 kilobyte files (at least not yet).
redundantly · 1h ago
Probably wasting space and lower performance.
ttfvjktesd · 1h ago
How does TernFS compare to CephFS and why not CephFS, since it is also tested for the multiple Petabyte range?
rostayob · 1h ago
(Disclaimer: I'm one of the authors of TernFS and while we evaluated Ceph I am not intimately familiar with it)

Main factors:

* Ceph stores both metadata and file contents using the same object store (RADOS). TernFS uses a specialized database for metadata which takes advantage of various properties of our datasets (immutable files, few moves between directories, etc.).

* While Ceph is capable of storing PBs, we currently store ~600PBs on a single TernFS deployment. Last time we checked this would be an order of magnitude more than even very large Ceph deployments.

* More generally, we wanted a system that we knew we could easily adapt to our needs and more importantly quickly fix when something went wrong, and we estimated that building out something new rather than adapting Ceph (or some other open source solution) would be less costly overall.

mgrandl · 43m ago
There are definitely insanely large Ceph deployments. I have seen hundreds of PBs in production myself. Also your usecase sounds like something that should be quite manageable for Ceph to handle due to limited metadata activity, which tends to be the main painpoint with CephFS.
rostayob · 19m ago
I'm not fully up to date since we looked into this a few years ago, at the time the CERN deployments of Ceph were cited as particularly large examples and they topped out at ~30PB.

Also note that when I say "single deployment" I mean that the full storage capacity is not subdivided in any way (i.e. there are no "zones" or "realms" or similar concepts). We wanted this to be the case after experiencing situations where we had significant overhead due to having to rebalance different storage buckets (albeit with a different piece of software, not Ceph).

If there are EB-scale Ceph deployments I'd love to hear more about them.

kachapopopow · 30m ago
Ceph is more of: here's a raw block of data, do whatever the hell you want with it, not really good for immutable data.
bananapub · 31m ago
seems like a colossusly nice design.
VikingCoder · 13m ago
I see what you did there.
nunobrito · 50m ago
Thanks for sharing.
sreekanth850 · 51m ago
Wow, great project.