TernFS – An exabyte scale, multi-region distributed filesystem

71 rostayob 14 9/18/2025, 2:36:44 PM xtxmarkets.com ↗

Comments (14)

mrbluecoat · 1h ago
Cool project and kudos for open sourcing it. Noteworthy limitation:

> TernFS should not be used for tiny files — our median file size is 2MB.

jandrewrogers · 20m ago
I have worked on exabyte-scale storage engines, there is a good engineering reason for this type of limitation.

If you had 1 KiB average file size then you have quadrillions of metadata objects to quickly search and manage with fine-granularity. The kinds of operations and coordination you need to do with metadata is difficult to achieve reliably when the metadata structure itself is many PB in size. There are many interesting edge cases that show up when you have to do deep paging of this metadata off of storage. Making this not slow requires unorthodox and unusual design choices that introduce a lot of complexity. Almost none of the metadata fits in memory, including many parts of conventional architectures we assume will easily fit in memory.

A mere trillion objects is right around the limit of where the allocators, metadata, etc can be made to scale with heroic efforts before conventional architectures break down and things start to become deeply weird on the software design side. Storage engines need to be reliable, so avoiding that design frontier makes a lot of sense if you can avoid it.

It is possible to break this barrier but it introduces myriad interesting design and computer science problems for which there is little literature.

heipei · 1h ago
Yeah, that was the first thing I checked as well. Being suited for small / tiny files is a great property of the SeaweedFS system.
pandemic_region · 55m ago
What happens if you put a tiny file on it then? Bad perf, possible file corruption, ... ?
jleahy · 48m ago
It's just not optimised for tiny files. It absolutely would work with no problems at all, and you could definitely use it to store 100 billion 1kB files with zero problems (and that is 100 terabytes of data, probably on flash, so no joke). However you can't use it to store 1 exabyte of 1 kilobyte files (at least not yet).
redundantly · 51m ago
Probably wasting space and lower performance.
nunobrito · 33m ago
Thanks for sharing.
ttfvjktesd · 58m ago
How does TernFS compare to CephFS and why not CephFS, since it is also tested for the multiple Petabyte range?
rostayob · 49m ago
(Disclaimer: I'm one of the authors of TernFS and while we evaluated Ceph I am not intimately familiar with it)

Main factors:

* Ceph stores both metadata and file contents using the same object store (RADOS). TernFS uses a specialized database for metadata which takes advantage of various properties of our datasets (immutable files, few moves between directories, etc.).

* While Ceph is capable of storing PBs, we currently store ~600PBs on a single TernFS deployment. Last time we checked this would be an order of magnitude more than even very large Ceph deployments.

* More generally, we wanted a system that we knew we could easily adapt to our needs and more importantly quickly fix when something went wrong, and we estimated that building out something new rather than adapting Ceph (or some other open source solution) would be less costly overall.

mgrandl · 26m ago
There are definitely insanely large Ceph deployments. I have seen hundreds of PBs in production myself. Also your usecase sounds like something that should be quite manageable for Ceph to handle due to limited metadata activity, which tends to be the main painpoint with CephFS.
rostayob · 2m ago
I'm not fully up to date since we looked into this a few years ago, at the time the CERN deployments of Ceph were cited as particularly large examples and they topped out at ~30PB.

Also note that when I say "single deployment" I mean that the full storage capacity is not subdivided in any way (i.e. there are no "zones" or "realms" or similar concepts). We wanted this to be the case after experiencing situations where we had significant overhead due to having to rebalance different storage buckets (albeit with a different piece of software, not Ceph).

If there are EB-scale Ceph deployments I'd love to hear more about them.

kachapopopow · 13m ago
Ceph is more of: here's a raw block of data, do whatever the hell you want with it, not really good for immutable data.
sreekanth850 · 34m ago
Wow, great project.
bananapub · 15m ago
seems like a colossusly nice design.