The future of large files in Git is Git

196 thcipriani 87 8/15/2025, 8:07:06 PM tylercipriani.com ↗

Comments (87)

bob1029 · 2h ago
> Large object promisors are special Git remotes that only house large files.

I like this approach. If I could configure my repos to use something like S3, I would switch away from using LFS. S3 seems like a really good synergy for large blobs in a VCS. The intelligent tiering feature can move data into colder tiers of storage as history naturally accumulates and old things are forgotten. I wouldn't mind a historical checkout taking half a day (i.e., restored from a robotic tape library) if I am pulling in stuff from a decade ago.

a_t48 · 1h ago
At my current job I've started caching all of our LFS objects in a bucket, for cost reasons. Every time a PR is run, I get the list of objects via `git lfs ls-files`, sync them from gcp, run `git lfs checkout` to actually populate the repo from the object store, and then `git lfs pull` to pick up anything not cached. If there were uncached objects, I push them back up via `gcloud storage rsync`. Simple, doesn't require any configuration for developers (who only ever have to pull new objects), keeps the Github UI unconfused about the state of the repo.

I'd initially at spinning up an LFS backends, but this solves the main pain point, for now. Github was charging us an arm and a leg for pulling LFS files for CI, because each checkout is fresh, the caching model is non-ideal (max 10GB cache, impossible to share between branches), so we end up pulling a bunch of data that is unfortunately in LFS, every commit, possibly multiple times. Because of this they happily charge us for all that bandwidth, because they don't provide tools to make it easy to reduce bandwidth (let me pay for more cache size, or warm workers with an entire cache disc, or better cache control, or...).

...and if I want to enable this for developers it's relatively easy, just add a new git hook to do the same set of operations locally.

nullwarp · 1h ago
Same and never understood why it wasn't the default from the get go but maybe it wasn't so synonymous when it first came out.

I run a small git LFS server because of this and will be happy to switch away the second I can get git to natively support S3.

jauer · 4h ago
TFA asserts that Git LFS is bad for several reasons including because proprietary with vendor lock-in which I don't think is fair to claim. GitHub provided an open client and server which negates that.

LFS does break disconnected/offline/sneakernet operations which wasn't mentioned and is not awesome, but those are niche workflows. It sounds like that would also be broken with promisors.

The `git partial clone` examples are cool!

The description of Large Object Promisors makes it sound like they take the client-side complexity in LFS, move it server-side, and then increases the complexity? Instead of the client uploading to a git server and to a LFS server it uploads to a git server which in turn uploads to an object store, but the client will download directly from the object store? Obviously different tradeoffs there. I'm curious how often people will get bit by uploading to public git servers which upload to hidden promisor remotes.

IshKebab · 4h ago
LFS is bad. The server implementations suck. It conflates object contents with the storage method. It's opt-in, in a terrible way - if you do the obvious thing you get tiny text files instead of the files you actually want.

I dunno if their solution is any better but it's fairly unarguable that LFS is bad.

jayd16 · 2h ago
It does seem like this proposal has exactly the same issue. Unless this new method blocks cloning when unable to access the promisors, you'll end up with similar problems of broken large files.
cowsandmilk · 1h ago
How so? This proposal doesn’t require you to run `git lfs install` to get the correct files…
jayd16 · 30m ago
If the architecture is irrelevant and it's just a matter of turning it on by default they could have done that with LFS long ago.
AceJohnny2 · 1h ago
Another way that LFS is bad, as I recently discovered, is that the migration will pollute the `.gitattributes` of ancestor commits that do not contain the LFS objects.

In other words, if you migrate a repo that has commits A->B->C, and C adds the large files, then commits A & B will gain a `.gitattributes` referring to the large files that do not exist in A & B.

This is because the migration function will carry its ~gitattributes structure backwards as it walks the history, for caching purposes, and not cross-reference it against the current commit.

cma · 3h ago
Git LFS didn't work with SSH, you had to get an SSL cert which github knew was a barrier for people self hosting at home. I think gitlab got it patched for SSH finally though.
remram · 1h ago
letsencrypt launched 3 years before git-lfs
Ferret7446 · 1h ago
This article treats LFS unfairly. It does not in any way lock you in to GitHub; the protocol is open. The downsides of LFS are unavoidable as a Git extension. Promisors are basically the same concept as LFS, except as it's built into Git it is able to provide a better UX than is possible as an extension.
andrewmcwatters · 8m ago
Using LFS once in a repository locks you in permanently. You actually have to delete the repository from GitHub to remove the space consumed. It’s entirely a non-starter.

Nowhere is this behavior explicitly stated.

I used to use Git LFS on GitHub to do my company’s study on GitHub statistics because we stored large compressed databases on users and repositories.

throwaway290 · 4m ago
This conflates Git and Github. Github is crap, news at 11. Git itself is fine and LFS is an extension for Git. There is nothing in LFS spec that discusses storage billing. Anyone can write a better server
andrewmcwatters · 24s ago
It does because overwhelmingly the usage of Git is through GitHub. Everything else is practically a rounding error.
glitchc · 4h ago
No. This is not a solution.

While git LFS is just a kludge for now, writing a filter argument during the clone operation is not the long-term solution either.

Git clone is the very first command most people will run when learning how to use git. Emphasized for effect: the very first command.

Will they remember to write the filter? Maybe, if the tutorial to the cool codebase they're trying to access mentions it. Maybe not. What happens if they don't? It may take a long time without any obvious indication. And if they do? The cloned repo might not be compilable/usable since the blobs are missing.

Say they do get it right. Will they understand it? Most likely not. We are exposing the inner workings of git on the very first command they learn. What's a blob? Why do I need to filter on it? Where are blobs stored? It's classic abstraction leakage.

This is a solved problem: Rsync does it. Just port the bloody implementation over. It does mean supporting alternative representations or moving away from blobs altogether, which git maintainers seem unwilling to do.

IshKebab · 4h ago
I totally agree. This follows a long tradition of Git "fixing" things by adding a flag that 99% of users won't ever discover. They never fix the defaults.

And yes, you can fix defaults without breaking backwards compatibility.

Jenk · 3h ago
> They never fix the defaults

Not strictly true. They did change the default push behaviour from "matching" to "simple" in Git 2.0.

hinkley · 3h ago
So what was the second time the stopped watch was right?

I agree with GP. The git community is very fond of doing checkbox fixes for team problems that aren’t or can’t be set as defaults and so require constant user intervention to work. See also some of the sparse checkout systems and adding notes to commits after the fact. They only work if you turn every pull and push into a flurry of activity. Which means they will never work from your IDE. Those are non fixes that pollute the space for actual fixes.

No comments yet

ks2048 · 3h ago
> This is a solved problem: Rsync does it.

Can you explain what the solution is? I don't mean the details of the rsync algorithm, but rather what it would like like from the users' perspective. What files are on your local filesystem when you do a "git clone"?

hinkley · 3h ago
When you do a shallow clone, no files would be present. However when doing a full clone you’ll get a full copy of each version of each blob, and what is being suggested is treat each revision as an rsync operation upon the last. And the more times you muck with a file, which can happen a lot both with assets and if you check in your deps to get exact snapshotting of code, that’s a lot of big file churn.
TGower · 3h ago
> The cloned repo might not be compilable/usable since the blobs are missing.

Only the histories of the blobs are filtered out.

spyrja · 3h ago
Would it be incorrect to say that most of the bloat relates to historical revisions? If so, maybe an rsync-like behavior starting with the most current version of the files would be the best starting point. (Which is all most people will need anyhow.)
pizza234 · 3h ago
> Would it be incorrect to say that most of the bloat relates to historical revisions?

Based on my experience (YMMV), I think it is incorrect, yes, because any time I've performed a shallow clone of a repository, the saving wasn't as much as one would intuitively imagine (in other words: history is stored very efficiently).

spyrja · 2h ago
Doing a bit of digging seems to confirm that, considering that git actually does remove a lot of redundant files during the garbage collection phase. It does however store complete files (unlike a VCS like mercurial which stores deltas) so nonetheless it still might benefit from a download-the-current-snapshot-first approach.
matheusmoreira · 3h ago
It is a solution. The fact beginners might not understand it doesn't really matter, solutions need not perish on that alone. Clone is a command people usually run once while setting up a repository. Maybe the case could be made that this behavior should be the default and that full clones should be opt-in but that's a separate issue.
gschoeni · 57m ago
We're working on `oxen` to solve a lot of the problems we ran into with git or git-lfs.

We have an open source CLI and server that mirrors git, but handles large files and mono repos with millions of files in a much more performant manner. Would love feedback if you want to check it out!

https://github.com/Oxen-AI/Oxen

technoweenie · 2h ago
I'm really happy to see large file support in Git core. Any external solution would have similar opt-in procedures. I really wanted it to work seamlessly with as few extra commands as possible, so the API was constrained to the smudge and clean filters in the '.gitattributes' file.

Though I did work hard to remove any vendor lock-in by working directly with Atlassian and Microsoft pretty early in the process. It was a great working relationship, with a lot of help from Atlassian in particular on the file locking API. LFS shipped open source with compatible support in 3 separate git hosts.

jameshart · 3h ago
Nit:

> if I git clone a repo with many revisions of a noisome 25 MB PNG file

FYI ‘noisome’ is not a synonym for ‘noisy’ - it’s more of a synonym for ‘noxious’; it means something smells bad.

williadc · 2h ago
I believe that was the author's intent.
jayd16 · 1h ago
How about git just fixes shallow clones and partial clones? Then we don't need convoluted work arounds to cheat in large content after we fully clone a history of pointers or promises or whatever. You should be able to set default clone depth by file type and size in the git attributes (or maybe a file that can also live above a repo like supporting attributes in .gitconfig locations?).

Then the usual settings would be to shallow clone the latest content as well as fetch the full history and maybe the text file historical content. Ideally you could prune to the clone depth settings as well.

Why are we still talking about large file pointers? If you fix shallow and partial clones, then any repo can be an efficient file mirror, right?

captn3m0 · 20m ago
Partial clones are also dependent on the server side supporting this. GitHub is one of the very few that does. git.kernel.org for eg did not, last I checked.
bahmboo · 3h ago
I'm just dipping my toe into Data Version Control - DVC. It is aimed towards data science and large digital asset management using configurable storage sources under a git meta layer. The goal is separation of concerns: git is used for versioning and the storage layers are dumb storage.

Does anyone have feedback about personally using DVC vs LFS?

memmel · 25m ago
I'm in the same boat - I decided this week for DVC over LFS.

For me, the deciding factor was that with LFS, if you want to delete objects from storage, you have to rewrite git history. At least, that's what both the Github and Gitlab docs specify.

DVC adds a layer of indirection, so that its structure is not directly tied to git. If I change my mind and delete the objects from S3, dvc might stop working, but git will be fine.

Some extra pluses about DVC: - It can point to versioned S3 objects that you might already have as part of existing data pipelines. - It integrates with the Python fsspec library to read the files on demand using paths like "dvc://path/to/file.parquet". This feels nicer than needing to download all the files up front.

Evidlo · 1h ago
I did a simple test tracking a few hundred gigs of random /dev/urandom data. LFS choked on upload speed while DVC worked fine. My team is using DVC now
bokchoi · 2h ago
It sounds like git-annex might be a good option for you. There is also https://www.datalad.org/ built on top of git-annex for large data management.
goneri · 3h ago
git-annex is a good alternative to the solution of Githu, and it supports different storage backends. I'm actually surprised it's not more popular.

No comments yet

HexDecOctBin · 3h ago
So this filter argument will reduce the repo size when cloning, but how will one reduce the repo size after a long stint of local commits of changing binary assets? Delete the repo and clone again?
viraptor · 2h ago
It's really not clear which behaviour you want though. For example when you do lots of bisects you probably want to keep everything downloaded locally. If you're just working on new things, you may want to prune the old blobs. This information only exists in your head though.
HexDecOctBin · 2h ago
The ideal behaviour is so have a filter on push too, meaning that files above a certain size should be deleted from non-latest history after push.
viraptor · 1h ago
That would prevent old revisions from working... Why would that be ideal?
firesteelrain · 2h ago
Yes once it gets bad enough your only option is to abandon and move the source code only. Your old repo has the history pre abandon.
actinium226 · 2h ago
For lots of local edits you can squash commits using the rebase command with the interactive flag.
reactordev · 3h ago
yeah, this isn't really solving the problem. It's just punting it. While I welcome a short-circuit filter, I see dragons ahead. Dependencies. Assets. Models... won't benefit at all as these repos need the large files - hence why there are large files.
rezonant · 2h ago
There seems to be a misunderstanding. The --filter option simple doesn't populate content in the .git directory which is not required for the checkout. If there is a file that is large which is needed for the current checkout (ie the parts not in the .git folder), it will be fetched regardless of the filter option.

To put it another way, regardless of what max size you give to --filter, you will end up with a complete git checkout, no missing files.

actuallyalys · 1h ago
It’s definitely not a full solution, but it seems like it would solve cases where having the full history of the large files available, just not on everyone’s machine, is the desired behavior.
kerneltime · 1h ago
https://github.com/oneconcern/datamon Had written this git for data tool few years back (works with GCS but can be made to work with S3) 1. No server side 2. Immutable data (via GCS policies) 3. Ability to mount data sets as filesystems 4. Integrated with k8s. It was built to work for the needs of the startup funding it, but I would love it if it could be extended.
mathi0750 · 1h ago
Have you tried Oxen.ai? they are doing more fine-tuning and inference now but they have an open-source data version control platform written in rust at the core of their product.
tombert · 4h ago
Is Git ever going to get proper support for binary files?

I’ve never used it for anything serious but my understanding is that Mercurial handles binary files better? Like it supports binary diffs if I understand correctly.

Any reason Git couldn’t get that?

brucehoult · 23m ago
All files in git are binary files.

All deltas between versions are binary diffs.

Git has always handled large (including large binary) files just fine.

What it doesn't like is files where a conceptually minor change changes the entire file, for example compressed or encrypted files.

The only somewhat valid complaint is that if someone once committed a large file and then it was later deleted (maybe minutes later, maybe years later) then it is in the repo and in everyone's checkouts forever. Which applies equally to small and to large files, but large ones have more impact.

That's the whole point of a version control system. To preserve the history, allowing earlier versions to be recreated.

The better solution would be to have better review of changes pushed to the master repo, including having unreviewed changes in separate, potentially sacrificial, repos until approved.

firesteelrain · 2h ago
A lot of people use Perforce Helix and others use Plastic SCM. That’s been my experience for like large binary assets with git-like functionality
tom_ · 55m ago
I didn't enjoy using Plastic, but Perforce is ok (not to say that it's perfect - I miss a lot of git stuff). It does have no problems with lots of data though! This article moans about the overhead of a 25 MB png file... it's been a long time since i worked on a serious project where the head revision is less than 25 GB. Typical daily churn would be 2.5 GB+.

(It's been even longer since i used svn in anger, but maybe it could work too. It has file locking, and local storage cost is proportional to size of head revision. It was manageable enough with a 2 GB head revision. Metadata access speed was always terrible though, which was tedious.)

firesteelrain · 37m ago
SVN should be able to handle large files no issue imho
ks2048 · 3h ago
I'm not sure binary diffs are the problem - e.g. for storing images or MP3s, binary diffs are usually worse than nothing.
digikata · 3h ago
I would think that git would need a parallel storage scheme for binaries. Something that does binary chunking and deduplication between revisions, but keeps the same merkle referencing scheme as everything else.
tempay · 3h ago
> binary chunking and deduplication

Are there many binaries that people would store in git where this would actually help? I assume most files end up with compression or some other form of randomization between revisions making deduplication futile.

adastra22 · 1h ago
A lot in the game and visual art industries.
digikata · 3h ago
I don't know, it's all probability in the dataset that makes one optimization strategy better over another. Git annex iirc does file level dedupe. That would take care of most of the problem if you're storing binaries that are compressed or encrypted. It's a lot of work to go beyond that, and probably one reason no one has bothered with git yet. But borg and restic both do chunked dedupe I think.
hinkley · 3h ago
It would likely require more tooling.
a_t48 · 1h ago
The real GH LFS cost is not the storage but the bandwidth on pulling objects down for every fresh clone. $$$$$. See my other comment. :)
nixpulvis · 2h ago
I was just using git LFS and was very concerned with how bad the help message was compared to the rest of git. I know it seems small, but it just never felt like a team player, and now I'm very happy to hear this.
anon-3988 · 1h ago
What prevents Git from simply working better with large files?
AceJohnny2 · 1h ago
git works just fine with large files. The problem is that when you clone a repo, or pull, by default it gets everything, including large files deep in the history that you probably don't care about anymore.

That was actually an initial selling point of git: you have the full history locally. You can work from the plane/train/deserted island just fine.

These large files will persist in the repo forever. So people look for options to segregate large files out so that they only get downloaded on demand (aka "lazily").

All the existing options (submodules, LFS, partial clones) are different answers to "how do we make certain files only download on demand"

anon-3988 · 54m ago
IIRC, it take ages for it to index a large folder. I was trying to use it to store the diff of my backup folder that constantly get rclone'd and rsync'd over in case those fucked up catastrophically
als0 · 4h ago
10 years late is better than never.
Affric · 4h ago
Incredible.

Nice to see some Microsoft and Google emails contributing.

matheusmoreira · 4h ago
As it should be! If it's not native to git, it's not worth using. I'm glad these issues are finally being solved.

These new features are pretty awesome too. Especially separate large object remotes. They will probably enable git to be used for even more things than it's already being used for. They will enable new ways to work with git.

jiggawatts · 3h ago
What I would love to see in an SCM that properly supports large binary blobs is storing the contents using Prolly trees instead of a simple SHA hash.

Prolly trees are very similar to Merkle trees or the rsync algorithm, but they support mutation and version history retention with some nice properties. For example: you always obtain exactly the same tree (with the same root hash) irrespective of the order of incremental edit operations used to get to the same state.

In other words, two users could edit a subset of a 1 TB file, both could merge their edits, and both will then agree on the root hash without having to re-hash or even download the entire file!

Another major advantage on modern many-core CPUs is that Prolly trees can be constructed in parallel instead of having to be streamed sequentially on one thread.

Then the really big brained move is to store the entire SCM repo as a single Prolly tree for efficient incremental downloads, merges, or whatever. I.e.: a repo fork could share storage with the original not just up to the point-in-time of the fork, but all future changes too.

hinkley · 3h ago
Git has had a good run. Maybe it’s time for a new system built by someone who learned about DX early in their career, instead of via their own bug database.

If there’s a new algorithm out there that warrants a look…

viraptor · 2h ago
Jujutsu unfortunately doesn't have any story for large files yet (as far as I can tell), but maybe soon ...
sublinear · 3h ago
May I humbly suggest that those files probably belong in an LFS submodule called "assets" or "vendor"?

Then you can clone without checking out all the unnecessary large files to get a working build, This also helps on the legal side to correctly license your repos.

I'm struggling to see how this is a problem with git and not just antipatterns that arise from badly organized projects.

charcircuit · 3h ago
The user shouldn't have to think about such a thing. Version control should handle everything automatically and not force the user into doing extra work to workaround issues.
hinkley · 3h ago
I always hated the “write your code like the next maintainer is a psychopath” mantra because it makes the goal unclear. I prefer the following:

Write your code/tools as if they will be used at 2:00 am while the server room is on fire. Because sooner or later they will be.

A lot of our processes are used like emergency procedures. Emergency procedures are meant to be brainless as much as possible. So you can reserve the rest of your capacity for the actual problem. My version essentially calls out Kernighan’s Law.

sublinear · 2h ago
Organizing your files sensibly is not necessary to use LFS nor is it a "workaround". It's just a pattern I am suggesting to make life easier regardless of what tools you decide to use. I can't think of a case where organizing your project to fail gracefully is a bad idea.

Git does the responsible thing and lets the user determine how to proceed with the mess they've made.

I must say I'm increasingly suspicious of the hate that git receives these days.

forrestthewoods · 3h ago
Git is fundamentally broken and bad. Almost all projects are defacto centralized. Your project is not Linux.

A good version control system would support petabyte scale history and terabyte scale clones via sparse virtual filesystem.

Git’s design is just bad for almost all projects that aren’t Linux.

(I know this will get downvoted. But most modern programmers have never used anything but Git and so they don’t realize their tool is actually quite bad! It’s a shame.)

codethief · 3h ago
> A good version control system would support petabyte scale history and terabyte scale clones via sparse virtual filesystem.

I like this idea in principle but I always wonder what that would look in practice, outside a FAANG company: How do you ensure the virtual file system works equally well on all platforms, without root access, possibly even inside containers? How do you ensure it's fast? What do you do in case of network errors?

forrestthewoods · 2h ago
Someone just needs to do. Numerous companies have built their own cross-platform VFS layers. It’s hard but not intractable.

Re network errors. How many things break when GitHub is down? Quite a lot! This isn’t particularly special. Prefetch and clone are the same operation.

DonHopkins · 2h ago
NFS server not responding. Still trying...

Tom Lyon: NFS Must Die! From NLUUG 2024:

https://www.youtube.com/watch?v=ZVF_djcccKc

>Why NFS must die, and how to get Beyond FIle Sharing in the cloud.

Slides:

https://nluug.nl/bestanden/presentaties/2024-05-21-tom-lyon-...

Eminent Sun alumnus says NFS must die:

https://blocksandfiles.com/2024/06/17/eminent-sun-alumnus-sa...

ants_everywhere · 3h ago
Yeah we're at the CVS stage where everyone uses it because everyone uses it.

But most people don't need most of its features and many people need features it doesn't have.

If you look up git worktrees, you'll find a lot of blog articles referring to worktrees as a "secret weapon" or similar. So git's secret weapon is a mode that lets you work around the ugliness of branches. This suggests that many people would be better suited by an SCM that isn't branch-based.

It's nice having the full history offline. But the scaling problems force people to adopt a workflow where they have a large number of small git repos instead of keeping the history of related things together. I think there are better designs out there for the typical open source project.

matheusmoreira · 29m ago
I don't understand what you mean by "the ugliness of branches".

In my experience, branches are totally awesome. Worktrees make branches even more awesome because they let me check out multiple branches at once to separate directories.

The only way it could get better is if it somehow gains the ability to check out the same branch to multiple different directories at once.

DonHopkins · 2h ago
Git now has artificial feet to aim the foot guns at so they hit the right target.
matheusmoreira · 3h ago
Completely disagree. Git is fundamentally functional and good. All projects are local and decentralized, and any "centralization" is in fact just git hosting services, of which there are many options which are not even mutually exclusive.
compiler-guy · 1h ago
Got works fine and is solid and well enough known to be a reasonable choice for most people.

But I encourage everyone to try out a few alternatives (and adopt their workflows at least for a while). I have no idea if you have or not.

But fine has never used the alternatives, one doesn’t really know just how nice things can be. Or, even if you still find fit to be your preferred can, having an alternative experience can open you to other possibilities and ways of working.

Just like everyone should try a couple of different programming languages or editors or anything else for size. You may not end up choosing it, but seeing the possibilities and different ways of thinking is a very good thing.

the_arun · 2h ago
Are you missing the central hosting services provide a good backup plan for your locally hosted git?
matheusmoreira · 2h ago
I agree! They are excellent git backup services. I use several of them: github, codeberg, gitlab, sourcehut. I can easily set up remotes to push to all of them at once. I also have copies of my important repositories on all my personal computers, including my phone.

This is only possible because git is decentralized. Claiming that git is centralized is complete falsehood.

firesteelrain · 2h ago
We had a repo that was at one point 25GB. It had Git LFS turned on but the files weren’t stored outside of BitBucket. Whenever a build was run in Bamboo, it would choke big time.

We found that we could move the large files to Artifactory as it has Git LFS support.

But the problem was the entire history that did not have Artifactory pointers. Every clone included the large files (for some reason the filter functionality wouldn’t work for us - it was a large repo and it it had hundreds of users amongst other problems)

Anyways what we ended up doing was closing that repo and opening a new one with the large files stripped.

Nitpick in the authors page:

“ Nowadays, there’s a free tier, but you’re dependent on the whims of GitHub to set pricing. Today, a 50GB repo on GitHub will cost $40/year for storage”

This is not true as you don’t need GitHub to get LFS support

whatever1 · 1h ago
It is insane that almost after a century of running computations with data on computers we still don't have a good version control system that maps a code version to its relevant data version.

Still the approach is to put code and data in a folder and call it a day. Slap a "_FINAL" at the folder name and you are golden.

No comments yet