TernFS – An exabyte scale, multi-region distributed filesystem

118 rostayob 30 9/18/2025, 2:36:44 PM xtxmarkets.com ↗

Comments (30)

eps · 4m ago

That was a good read. Compliments to the chefs.

It'd be helpful to have a couple of usage examples that illustrate common operations, like creating a file or finding and reading one, right after the high-level overview section. Just to get an idea what happens at the service level in these cases.

bitonico · 1m ago

Yes, that would be very useful, we just didn't get to it and we didn't want perfect to be the enemy of good, since otherwise we would have never open sourced :).

But if we have the time it would definitely be a good addition to the docs.

rickette · 16m ago

Over 500PB of data, wow. Would love to know how and why "statistical models that produce price forecasts for over 50,000 financial instruments worldwide" require that much storage.

mrbluecoat · 2h ago

Cool project and kudos for open sourcing it. Noteworthy limitation:

> TernFS should not be used for tiny files — our median file size is 2MB.

jandrewrogers · 2h ago

I have worked on exabyte-scale storage engines. There is a good engineering reason for this type of limitation.

If you had 1 KiB average file size then you have quadrillions of metadata objects to quickly search and manage with fine-granularity. The kinds of operations and coordination you need to do with metadata is difficult to achieve reliably when the metadata structure itself is many PB in size. There are interesting edge cases that show up when you have to do deep paging of this metadata off of storage. Making this not slow requires unorthodox and unusual design choices that introduce a lot of complexity. Almost none of the metadata fits in memory, including many parts of conventional architectures we assume will always fit in memory.

A mere trillion objects is right around the limit of where the allocators, metadata, etc can be made to scale with heroic efforts before conventional architectures break down and things start to become deeply weird on the software design side. Storage engines need to be reliable, so avoiding that design frontier makes a lot of sense if you can avoid it.

It is possible to break this barrier but it introduces myriad interesting design and computer science problems for which there is little literature.

stuartjohnson12 · 13m ago

This sounds like a fascinating niche piece of technical expertise I would love to hear more about.

What are the biggest challenges in scaling metadata from a trillion to a quadrillion objects?

heipei · 2h ago

Yeah, that was the first thing I checked as well. Being suited for small / tiny files is a great property of the SeaweedFS system.

Eikon · 1h ago

Shameless plug: https://github.com/Barre/ZeroFS

I initially developed it for a usecase where I needed to store billions of tiny files, and it just requires a single s3 bucket as infrastructure.

pandemic_region · 2h ago

What happens if you put a tiny file on it then? Bad perf, possible file corruption, ... ?

jleahy · 2h ago

It's just not optimised for tiny files. It absolutely would work with no problems at all, and you could definitely use it to store 100 billion 1kB files with zero problems (and that is 100 terabytes of data, probably on flash, so no joke). However you can't use it to store 1 exabyte of 1 kilobyte files (at least not yet).

redundantly · 2h ago

Probably wasting space and lower performance.

ttfvjktesd · 2h ago

How does TernFS compare to CephFS and why not CephFS, since it is also tested for the multiple Petabyte range?

rostayob · 2h ago

(Disclaimer: I'm one of the authors of TernFS and while we evaluated Ceph I am not intimately familiar with it)

Main factors:

* Ceph stores both metadata and file contents using the same object store (RADOS). TernFS uses a specialized database for metadata which takes advantage of various properties of our datasets (immutable files, few moves between directories, etc.).

* While Ceph is capable of storing PBs, we currently store ~600PBs on a single TernFS deployment. Last time we checked this would be an order of magnitude more than even very large Ceph deployments.

* More generally, we wanted a system that we knew we could easily adapt to our needs and more importantly quickly fix when something went wrong, and we estimated that building out something new rather than adapting Ceph (or some other open source solution) would be less costly overall.

eps · 1m ago

Last point is an often overlooked (or even looked down upon), but an extremely important advantage. Having something that you know inside-out pays in gold in the long term.

mgrandl · 2h ago

There are definitely insanely large Ceph deployments. I have seen hundreds of PBs in production myself. Also your usecase sounds like something that should be quite manageable for Ceph to handle due to limited metadata activity, which tends to be the main painpoint with CephFS.

rostayob · 1h ago

I'm not fully up to date since we looked into this a few years ago, at the time the CERN deployments of Ceph were cited as particularly large examples and they topped out at ~30PB.

Also note that when I say "single deployment" I mean that the full storage capacity is not subdivided in any way (i.e. there are no "zones" or "realms" or similar concepts). We wanted this to be the case after experiencing situations where we had significant overhead due to having to rebalance different storage buckets (albeit with a different piece of software, not Ceph).

If there are EB-scale Ceph deployments I'd love to hear more about them.

mgrandl · 1h ago

There are much larger Ceph clusters, but they are enterprise owned and not really publicly talked about. Sadly I can’t share what I personally worked on.

rostayob · 57m ago

The question is whether there are single Ceph deployments are that large. I believe Hetzner uses Ceph for its cloud offering, and that's probably very large, but I'd imagine that no single tenant is storing hundreds of PBs in it. So it's very easy to shard across many Ceph instances. In our use-case we have a single tenant which stores 100s of PBs (and soon EBs).

kachapopopow · 2h ago

Ceph is more of: here's a raw block of data, do whatever the hell you want with it, not really good for immutable data.

mgrandl · 1h ago

Well sure you would have to enforce immutability at the client side.

jleahy · 46m ago

The seamless realtime intercontinental replication is a key feature for us, maybe the most important single feature, and AFAIK you can’t do that with Ceph (even if Ceph could scale to our original 10 exabyte target in one instance).

mdaniel · 39m ago

GPLv2-or-later, in case you were wondering https://github.com/XTXMarkets/ternfs/blob/7a4e466ac655117d24...

coolspot · 10m ago

Licensing

TernFS is Free Software. The default license for TernFS is GPL-2.0-or-later.

The protocol definitions (go/msgs/), protocol generator (go/bincodegen/) and client library (go/client/, go/core/) are licensed under Apache-2.0 with the LLVM-exception. This license combination is both permissive (similar to MIT or BSD licenses) as well as compatible with all GPL licenses. We have done this to allow people to build their own proprietary client libraries while ensuring we can also freely incorporate them into the GPL v2 licensed Linux kernel.

sreekanth850 · 2h ago

Wow, great project.

nunobrito · 2h ago

Thanks for sharing.

bananapub · 2h ago

seems like a colossusly nice design.

jleahy · 35m ago

could be a tectonic shift in the open source filesystem landscape?

VikingCoder · 1h ago

I see what you did there.

eigenvalue · 1h ago

This sounds like it would be a good underpinning for a decentralized blockchain file storage system with its focus on immutability and redundancy.

mrtesthah · 1h ago

And yet no one needed a blockchain to implement this.

CircuitHub (YC W12) Is Hiring Operations Research Engineers (UK/Remote) (ycombinator.com)

Event Horizon Labs (YC W24) Is Hiring (ycombinator.com)

Adam (YC W25) Is Hiring to Build the Future of CAD (ycombinator.com)

Piramidal (YC W24) Is Hiring Back End Engineer (ycombinator.com)

Mux (YC W16) Is Hiring Engineering ICs and Managers (mux.com)

Bild AI (YC W25) Is Hiring (ycombinator.com)

Infracost (YC W21) Is Hiring First Product Manager to Shift FinOps Left (ycombinator.com)

Crimson (YC X25) is hiring founding engineers in London (ycombinator.com)

Weave (YC W25) is hiring a founding AI engineer (ycombinator.com)

Nango (YC W23) Is Hiring a Staff Back End Engineer (Remote) (jobs.ashbyhq.com)

Gym Class VR (YC W22) Is Hiring – UX Design Engineer (ycombinator.com)

Relace (YC W23) Is Hiring for Code LLMs (SF)

Artie (YC S23) Is Hiring Engineers, AES, and Senior PMM (ycombinator.com)

Depot (YC W23) Is Hiring a Solutions Engineer (Remote US and Canada) (ycombinator.com)

Svix (webhooks as a service) is hiring for a founding marketing lead (svix.com)

Dynamo AI (YC W22) Is Hiring for AI Product Managers (ycombinator.com)

Kapa.ai (YC S23) is hiring research and software engineers (ycombinator.com)

Optery (YC W22) Is Hiring in Engineering, Legal, Sales, Marketing (U.S., Latam) (optery.com)

Telli (YC F24) is hiring engineers, designers, and interns (on-site in Berlin) (hi.telli.com)

Infisical (YC W23) Is Hiring Solutions Engineers to Scale the OSS Security Stack (ycombinator.com)

Channel3 (YC S25) Is Hiring a Founding Engineer, NYC (channel3.notion.site)

Thunder Compute (YC S24) Is Hiring (ycombinator.com)

Deepnote (YC S19) is hiring engineers to build a better Jupyter notebook (deepnote.com)

Prosper AI (YC S23) Is Hiring Founding Account Executives (NYC) (jobs.ashbyhq.com)

The Forecasting Company (YC S24) Is Hiring a Software Engineer (ycombinator.com)

Lago – Open-Source Usage Based Billing – Is Hiring in Sales, Eng, Ops (EU, US) (ycombinator.com)

Ember (YC F24) Is Hiring Full Stack Engineer (ycombinator.com)

LiteLLM (YC W23) is hiring a back end engineer (ycombinator.com)

SigNoz (YC W21, Open Source Datadog) Is Hiring Platform Engineers (Remote) (jobs.ashbyhq.com)

Motion (YC W20) Is Hiring Principal Software Engineers (jobs.ashbyhq.com)

Bild AI (YC W25) Is Hiring an Applied AI Engineer (workatastartup.com)

TernFS – An exabyte scale, multi-region distributed filesystem

Comments (30)