I wrote git-bigstore [0] almost 10 (!) years ago to solve this problem—even before Git LFS—and as far as I know, bigstore still works perfectly.
You specify the files you want to store in your storage backend via .gitattributes, and use two separate commands to sync files. I have not touched this code in years but the general implementation should still work.
GitHub launched LFS not too long after I wrote this, so I kind of gave up on the idea thinking that no one would want to use it in lieu of GitHub's solution, but based on the comments I think there's a place for it.
It needs some love but the idea is solid. I wrote a little description on the wiki about the low-level implementation if you want to check it out. [1]
Also, all of the metadata is stored using git notes, so is completely portable and is frontend agnostic—doesn't lock you into anything (except, of course, the storage backend you use).
> Large object promisors are special Git remotes that only house large files.
I like this approach. If I could configure my repos to use something like S3, I would switch away from using LFS. S3 seems like a really good synergy for large blobs in a VCS. The intelligent tiering feature can move data into colder tiers of storage as history naturally accumulates and old things are forgotten. I wouldn't mind a historical checkout taking half a day (i.e., restored from a robotic tape library) if I am pulling in stuff from a decade ago.
riedel · 10h ago
The article mentions alternatives to git lfs like git annex that support S3 already (which IMHO is however still a bit of a pain in the ass on windows due to the symlink workflow). Also dvc plays nicely with git and S3. Gitlab btw also simply offloads git lfs to S3. All have their quirks. I typically opt for LFS as a no-brainer but use the others when it fits the workflow and the infrastructure requirements.
Edit: Particularly the hash algorithm and the change detection (also when this happens) makes a difference if you have 2 GB files and not only the 25MB file from the OP
a_t48 · 19h ago
At my current job I've started caching all of our LFS objects in a bucket, for cost reasons. Every time a PR is run, I get the list of objects via `git lfs ls-files`, sync them from gcp, run `git lfs checkout` to actually populate the repo from the object store, and then `git lfs pull` to pick up anything not cached. If there were uncached objects, I push them back up via `gcloud storage rsync`. Simple, doesn't require any configuration for developers (who only ever have to pull new objects), keeps the Github UI unconfused about the state of the repo.
I'd initially at spinning up an LFS backends, but this solves the main pain point, for now. Github was charging us an arm and a leg for pulling LFS files for CI, because each checkout is fresh, the caching model is non-ideal (max 10GB cache, impossible to share between branches), so we end up pulling a bunch of data that is unfortunately in LFS, every commit, possibly multiple times. Because of this they happily charge us for all that bandwidth, because they don't provide tools to make it easy to reduce bandwidth (let me pay for more cache size, or warm workers with an entire cache disc, or better cache control, or...).
...and if I want to enable this for developers it's relatively easy, just add a new git hook to do the same set of operations locally.
tagraves · 16h ago
We use a somewhat similar approach in RWX when pulling LFS files[1]. We run `git lfs ls-files` to get a list of the lfs files, then pass that list into a task which pulls each file from the LFS endpoint using curl. Since in RWX the output of tasks are cached as long as their inputs don't change, the LFS files just stay in the RWX cache and are pulled from there on future clones in CI. In addition to saving on GitHub's LFS bandwidth costs, the RWX cache is also _much_ faster to restore from than `git lfs pull`.
Nice! I was considering using some sort of pull through cache like this, but went with the solution that didn’t require setting up more infra than a bucket.
gmm1990 · 17h ago
Why not run some open source ci locally or the google equivalent ec2, if you’re already going to the trouble of this much customization with running GitHub ci?
a_t48 · 12h ago
It was half a day of work to make a drop in action.yml that does this. Saved a bunch of money (both in bandwidth and builder minutes), well worth the investment. It really wasn’t a lot of customization.
All our builds are on GHA definitions, there’s no way it’s worth it to swap us over to another build system, administer it, etc. Our team is small (two at the time, but hopefully doubling soon!), and there’s barely a dozen people in the whole engineering org. The next hit list item is to move from GH hosted builders to GCE workers to get a warmer docker cache (a bunch of our build time is spent pulling images that haven’t changed) - it will also save a chunk of change (GCE workers are 4x cheaper per minute and the caching will make for faster builds), but the opportunity cost for me tackling that is quite high.
fmbb · 11h ago
Doesn’t the official docker build push action support caching with the GitHub Actions cache?
a_t48 · 12m ago
Yes but one image push for us is >10GB, due to ML dependencies. And even if it is intelligent and supports per layer caching, you can’t share between release branches - https://github.com/docker/build-push-action/issues/862.
And even if that did work, I’ve found it much more reliable to use the actual docker BuildX disk state than to try and get caching for complex multi stage builds working reliably. I have a case right now where there’s no combination of —cache-to/from flags that will give me a 100% cached rebuild starting from a fresh builder, using only remote cache. I should probably report it to the Docker team, but I don’t have a minimal repro right now and there’s a 10% chance it’s actually my fault.
Awesome. Technically you can go over the limit right now (ours was saying 93/10GB last I checked), but I don’t know the eviction policy. I’d rather pay a bit more and know for sure when data will stick around.
nullwarp · 18h ago
Same and never understood why it wasn't the default from the get go but maybe it wasn't so synonymous when it first came out.
I run a small git LFS server because of this and will be happy to switch away the second I can get git to natively support S3.
_bent · 6h ago
I'm currently running https://github.com/datopian/giftless to store the LFS files belonging to repos I have on GitHub on my homelab miniio instance.
There are a couple other projects that bridge S3 and LFS, though I had the most success with this setup.
johnisgood · 12h ago
Is S3 related to Amazon?
bayindirh · 10h ago
You can install your own S3 compatible storage system on premises. It can be anything from a simple daemon (Scality, JuiceFS) to a small appliance (TrueNAS) to a full-blown storage cluster (Ceph). OpenStack has it own object storage service (Swift).
If you fancy it for your datacenter, big players (Fujitsu, Lenovo, Huawei, HPE) will happily sell you "object storage" systems which also support S3 at very high speeds.
yugoslavia4ever · 10h ago
And for CI and local development testing you can use localstack which runs in a docker container and has implementations of most AWS services
bayindirh · 10h ago
Oh, that sounds interesting. We don't use AWS, but it's a nice alternative for people using AWS for their underpinnings.
Scality's open source S3 Server also can run in a container.
flohofwoe · 12h ago
Yeah it's AWS's 'cloud storage service'.
dotancohen · 11h ago
It's actually 'Simple Storage Service', hence the acronym S3.
StopDisinfo910 · 9h ago
Yes, S3 is the name of Amazon Object Storage Service. Various players in the industry have started offering solutions with a compatible API which some people abusively call S3 too.
jauer · 21h ago
TFA asserts that Git LFS is bad for several reasons including because proprietary with vendor lock-in which I don't think is fair to claim. GitHub provided an open client and server which negates that.
LFS does break disconnected/offline/sneakernet operations which wasn't mentioned and is not awesome, but those are niche workflows. It sounds like that would also be broken with promisors.
The `git partial clone` examples are cool!
The description of Large Object Promisors makes it sound like they take the client-side complexity in LFS, move it server-side, and then increases the complexity? Instead of the client uploading to a git server and to a LFS server it uploads to a git server which in turn uploads to an object store, but the client will download directly from the object store? Obviously different tradeoffs there. I'm curious how often people will get bit by uploading to public git servers which upload to hidden promisor remotes.
IshKebab · 21h ago
LFS is bad. The server implementations suck. It conflates object contents with the storage method. It's opt-in, in a terrible way - if you do the obvious thing you get tiny text files instead of the files you actually want.
I dunno if their solution is any better but it's fairly unarguable that LFS is bad.
jayd16 · 19h ago
It does seem like this proposal has exactly the same issue. Unless this new method blocks cloning when unable to access the promisors, you'll end up with similar problems of broken large files.
cowsandmilk · 18h ago
How so? This proposal doesn’t require you to run `git lfs install` to get the correct files…
jayd16 · 17h ago
If the architecture is irrelevant and it's just a matter of turning it on by default they could have done that with LFS long ago.
thayne · 14h ago
Git lfs can't do it by default because:
1. It is a separate tool that has to be installed separately from git
2. It works by using git filters and git hooks, which need to be set up locally.
Something built in to git doesn't have those problems.
xg15 · 11h ago
But then they could have just taken the LFS plugin and made it a core part of git, if that were the only problems.
vlovich123 · 14h ago
And what happens when an object is missing from the cloud storage or that storage has been migrated multiple times and someone turns down the old storage that’s needed for archival versions?
atq2119 · 4h ago
You obviously get errors in that case, which is not great.
But GP's point was that there is an entire other category of errors with git-lfs that are eliminated with this more native approach. Git-lfs allows you to get into an inconsistent state e.g. when you interrupt a git action that just doesn't happen with native git.
jayd16 · 3h ago
It's yet to be seen what it actually eliminates and what they're willing to actually enable by default.
The architecture does seem to still be in the general framing of "treat large files as special and host them differently." That is the crux of the problem in the first place.
I think it would shock no one to find that the official system also needs to be enabled and also falls back to a mode where it supports fetching and merging pointers without full file content.
I do hope all the UX problems will be fixed. I just don't see them going away naturally and we have to put our trust in the hope that the git maintainers will make enjoyable, seamless and safe commands.
ozim · 12h ago
I think maybe not storing large files in repo but managing those separately.
Mostly I did not run into such use case but in general I don’t see any upsides trying to shove some big files together with code within repositories.
tsimionescu · 10h ago
That is a complete no-go for many use cases. Large files can have exactly the same use cases as your code: you need to branch them, you need to know when and why they changed, you need to check how an old build with an old version of the large file worked, etc. Just because code tends to be small doesn't mean that all source files for a real program are going to be small too.
ozim · 7h ago
Yeah but GIT is not the tool for that.
That is why I don’t understand why people „need to use GIT”.
You still can make something else like keeping versions and keeping track of those versions in many different ways.
You can store a reference in repo like a link or whatever.
da_chicken · 1h ago
A version control system is a tool for managing a project, not exclusively a tool for managing source code.
Wanting to split up the project into multiple storage spaces is inherently hostile to managing the project. People want it together because it's important that it stays together as a basic function of managing a project of digital files. The need to track and maintain digital version numbers and linking them to release numbers and build plans is just a requirement.
That's what actual, real projects demand. Any projects that involve digital assets is going to involve binary, often large, data files. Any projects that involve large tables of pre-determined or historic data will involve large files that may be text or binary which contain data the project requires. They won't have everything encompassed by the project as a text file. It's weird when that's true for a project. It's a unique situation to the Linux kernel because it, somewhat uniquely, doesn't have graphics or large, predetermined data blocks. Well, not all projects that need to be managed by git share 100% of the attributes of the Linux kernel.
This idea that everything in a git project must be all small text file is incredibly bizarre. Are you making a video game? A website? A web application? A data driven API? Does it have geographic data? Does it required images? Video? Music or sound? Are you providing static documentation that must be included?
So the choices are:
1. Git is useful general purpose VCS for real world projects.
2. Git does not permit binary or large files.
Tracking versioning on large files is not some massively complex problem. Not needing to care about diffing and merging simplifies how those files are managed.
IshKebab · 7h ago
> Yeah but GIT is not the tool for that.
Yes because Git currently is not good at tracking large file. That's not some fundamental property of Git; it can be improved.
Btw it isn't GIT.
tsimionescu · 6h ago
The important point is that you don't want two separate histories. Maybe if your use case is very heavy on large files, you can choose a different SCM, which is better at this use case (SVN, Perforce). But using different SCMs for different files is a recipe for disaster.
jayd16 · 2h ago
That's pretty much what git LFS is...
mafuy · 7h ago
Git is the right tool. It's just bad at this job.
AceJohnny2 · 18h ago
Another way that LFS is bad, as I recently discovered, is that the migration will pollute the `.gitattributes` of ancestor commits that do not contain the LFS objects.
In other words, if you migrate a repo that has commits A->B->C, and C adds the large files, then commits A & B will gain a `.gitattributes` referring to the large files that do not exist in A & B.
This is because the migration function will carry its ~gitattributes structure backwards as it walks the history, for caching purposes, and not cross-reference it against the current commit.
actinium226 · 16h ago
That doesn't sound right. There's no way it's adding a file to previous commits, that would change the hash and thereby break a lot of things.
AceJohnny2 · 16h ago
`git lfs migrate ` rewrites the commits to convert large files in the repo to/from LFS pointers, so yes it does change the hashes. That's a well-documented effect.
Now, granted, usually people run migrate to only convert new local commits, so by nature of the ref include/exclude system it will not touch older commits. But in my case I was converting an entire repo into one using LFS. I hoped it would preserve those commits in a base branch that didn't contain large files, but my disappointment was said .gitattributes pollution.
actinium226 · 6h ago
From the documentation, like 2 paragraphs in:
> In all modes, by default git lfs migrate operates only on the currently checked-out branch, and only on files (of any size and type) added in commits which do not exist on any remote. Multiple options are available to override these defaults.
Were your remotes not configured correctly?
AceJohnny2 · 1h ago
Let me repeat myself:
> But in my case I was converting an entire repo into one using LFS.
then check out the section in the manual "INCLUDE AND EXCLUDE REFERENCES"
gradientsrneat · 2h ago
> LFS does break disconnected/offline/sneakernet operations which wasn't mentioned and is not awesome
Yea, I had the same thought. And TBD on large object promisors.
Git annex is somewhat more decentralized as it can track the presence of large files across different remotes. And it can pull large files from filesystem repos such as USB drives. The downside is that it's much more complicated and difficult to use. Some code forges used to support it, but support has since been dropped.
cma · 20h ago
Git LFS didn't work with SSH, you had to get an SSL cert which github knew was a barrier for people self hosting at home. I think gitlab got it patched for SSH finally though.
remram · 18h ago
letsencrypt launched 3 years before git-lfs
IndrekR · 9h ago
Letsencrypt was founded 2012, but become available in the wild December 2015. git-lfs mid-2014. So same era in general.
cma · 8h ago
That's already a domain name and a more complicated setup without a public static IP in home environments, and in corporate environments now you're dealing with a whole process etc. that might be easier to get through by.. paying out for github LFS.
I think it is a much bigger barrier than ssh and have seen it be one on short timeline projects where it's getting set up for the first time and they just end up paying github crazy per GB costs, or rat nests of tunnels vpn configurations for different repos to keep remote access with encryption with a whole lot more trouble than just an ssh path.
Sophira · 1h ago
How does this upcoming feature deal with the potential problem of fake commit IDs?
Commit IDs are based on a number of factors about the commit, including the actual contents and the commit ID of the parent commit. Any fully cloned git repository can theoretically be audited to make sure that all its commit IDs are correct. Nobody does this (although perhaps git does automatically?), but it's possible.
But now, picture a git repository that has a one petabyte file in one of its early commits (and deleted again later). Pretty much nobody is going to have the space required to download this, so many people will not even bother to do so. As such, what's to stop the server from just claiming any commit ID it wanted for this particular commit? Who's going to check?
(Bonus: For that matter, is the one petabyte file even real? Or just a faked size in the metadata?)
To be clear, I assume people have already thought about these issues. I'm just curious what the answers are.
KronisLV · 10h ago
> And the problems are significant:
> High vendor lock-in – When GitHub wrote Git LFS, the other large file systems—Git Fat, Git Annex, and Git Media—were agnostic about the server-side. But GitHub locked users to their proprietary server implementation and charged folks to use it.
Is this a current issue?
I used Git LFS with a GitLab instance this week, seemed to work fine.
At the same time it feels odd to hear mentions of LFS being deprecated in the future, while I’ve seldom seen anyone even use it - people just don’t seem to care and shove images and such into regular Git which puzzles me.
Ferret7446 · 18h ago
This article treats LFS unfairly. It does not in any way lock you in to GitHub; the protocol is open. The downsides of LFS are unavoidable as a Git extension. Promisors are basically the same concept as LFS, except as it's built into Git it is able to provide a better UX than is possible as an extension.
andrewmcwatters · 17h ago
Using LFS once in a repository locks you in permanently. You actually have to delete the repository from GitHub to remove the space consumed. It’s entirely a non-starter.
Nowhere is this behavior explicitly stated.
I used to use Git LFS on GitHub to do my company’s study on GitHub statistics because we stored large compressed databases on users and repositories.
throwaway290 · 17h ago
This conflates Git and Github. Github is crap, news at 11. Git itself is fine and LFS is an extension for Git. There is nothing in LFS spec that discusses storage billing. Anyone can write a better server
andrewmcwatters · 17h ago
It does because overwhelmingly the usage of Git is through GitHub. Everything else is practically a rounding error. So it’s incredibly helpful to know that the most popular large file retrieval extension to Git on the most popular Git host will lock you in.
integralid · 6h ago
>overwhelmingly the usage of Git is through GitHub. Everything else is practically a rounding error
Is that true? I used git commercially in five companies, and I never used github commercially (except as a platform for projects we opensourced).
You already depend on github if you host your project there. But you're not locked in, because you can just close your github repo and migrate somewhere else. Do I miss something?
bobmcnamara · 5h ago
> But you're not locked in, because you can just close your github repo and migrate somewhere else.
If you used LFS, you have to fork and rewrite your repository to update the .lfsconfig backend URLs to get back to a reasonable working state.
No comments yet
throwaway290 · 16h ago
That's very bad logic. By that logic Git sucks because Github sucks. I cringe every time people conflate it and if you know better then why... Just say Github sucks instead.
flohofwoe · 12h ago
Even ancient svn works much better out of the box for large binary files than git (e.g. a 150 GB working directory with many big Photoshop files and other binary assets is no problem in SVN).
What does SVN do differently than git when it comes to large binary files, and why can't git use the same approach?
I also don't quite understand tbh how offloading large files to somewhere else would be fundamentally different than storing all files in one place except complicating everything? Storage is storage, how would a different storage location fix any of the current performance and robustness problems? Offloading just sounds like a solution for public git forges which don't want to deal with big files because it's too costly for them, but increased hosting cost is not the 'large binary file problem' of git.
(edit: apparently git supports proper locking(?) so I removed that section - ps: nvm it looks like the file locking feature is only in git-lfs)
pjc50 · 12h ago
Completely different design. Git is intended to be fully distributed, so (a) every repo is supposed to have the full history of every file, and (b) locking is meaningless.
People should use the VCS that's appropriate for their project rather than insist on git everywhere.
dazzawazza · 9h ago
> People should use the VCS that's appropriate for their project rather than insist on git everywhere.
A lot of people don't seem to realise this. I work in game dev and SVN or Perforce are far far better than Git for source control in this space.
In AA game dev a checkout (not the complete history, not the source art files) can easily get to 300GB of binary data. This is really pushing Subversion to it's limits.
In AAA gamedev you are looking at a full checkout of the latest assets (not the complete history, not the source art files) of at least 1TB and 2TB is becoming more and more common. The whole repo can easily come in at 100 TB. At this scale Perforce is really the only game in town (and they know this and charge through the nose for it).
In the movie industry you can multiply AAA gamedev by ~10.
Git has no hope of working at this scale as much as I'd like it to.
jayd16 · 2h ago
Perforce gets the job done but it's a major reason why build tooling is worse in games.
Github/gitlab is miles ahead of anything you can get with Perforce. People are not just pushing for git because they ux of it, they're pushing git so they can use the ecosystem.
gmokki · 5h ago
I've been thinking of using git filter to split the huge asset files (that are just internally a collection of assets bundler to 200M-1GB files) into smaller ones. That way when artist modifies one sub-asset in a huge file only the small change is recorded in history.
There is an example filter for doing this with zip files.
The above should work. But does git support multiple filters for a file? For example first the above asset split filter and then store the files in LFS which is another filter.
dazzawazza · 4h ago
I mean it might work but you'll still get pull time-outs constantly with LFS. It's expensive to wait two or three days before you can start working on a project. Go way for two weeks, it will be a day before you can "pulled" up to date.
I hope this "new" system works but I think Perforce is safe for now.
xg15 · 11h ago
> People should use the VCS that's appropriate for their project rather than insist on git everywhere.
Disagree. I really like the "de-facto standard" that git has become for open source. It means if I want to understand some new project's source code, there is one less hassle for me to deal with: I don't need to learn any new concepts just to access the source code and all the tooling is already right there.
The situation we have with package managers, dependency managers and package managers for package managers is worse enough. I really don't want a world in which every language or every project also comes with its own version control system and remote repo infrastructure.
flohofwoe · 11h ago
A "proper" versioning system doesn't need to be learned since you literally only need a handful of straightforward operations (how do I get the latest version? how do I add a file? how do I commit changes?) - in svn that's 'svn update', 'svn add' and 'svn commit', that's all what's needed to get you through the day, no 'push', no 'staging area', no 'fetch' vs 'pull' and the inevitable merge-vs-rebase discussion... etc etc etc...)
It's only git which has this fractal feature set which requires expert knowledge to untangle.
xg15 · 11h ago
But if all systems are so similar anyway, why would you need "the right tool for the job"?
If nothing else, you have to install it. There will also be subtle differences between concepts, e.g. git and svn both have versions and branches, but the concepts behave differently. I don't know about Mercurial, but I'm sure they have their own quirks as well.
Also, tooling: I have a VSCode plugin that visualizes the entire graph structure of a git repo really nicely. Right now, I can use that on 99% of all repos to get an overview of the branches, last commits, activity, etc.
If version systems were fragmented, I'd have to look for equivalent tools for every versioning system separately - if they exist at all. More likely, I'd be restricted just to the on-board tools of every system.
trinix912 · 6h ago
> But if all systems are so similar anyway, why would you need "the right tool for the job"?
They’re similar in the UI but the underlying architecture is vastly different, to accomplish different goals - sometimes what you want is an entirely centralized VCS, decentralized VCS, or a mix of both.
As for the tooling, any decent IDE supports different systems equally well. With IntelliJ I can use Git, SVN, and even CVS through the same UI. But yes, VSCode plugin XYZ doesn’t.
madeofpalk · 6h ago
In the same way some might be discouraged from contributing to a project because they don't know the language well enough, I've given up on contributing to projects because I couldn't figure out mercurial, and I didn't care enough about the contribution to learn it.
cyphar · 11h ago
To be clear, "fully distributed" also means "can use all of the features offline, and without needing commit access to the central repository".
I can't imagine living without that feature, but I also do a lot of OSS work so I'm probably biased.
flohofwoe · 11h ago
How often are you fully offline though? A centralized version system could be updated to work in 'offline mode' by queueing pushed changes locally until internet is available again (and in SVN that would be quite trivial because it keeps the last known state of the repository in a local shadow copy anyway).
jayd16 · 2h ago
All the time. If you use p4 and git you'll notice the difference just from trying to move clone git repos locally vs trying to make a new workspace.
cyphar · 4h ago
It's not terribly common, but more critically it means I can work even when the source forge is down. (And for corporate stuff it means I can work on stuff without connecting to the VPN.)
Also, designing around distribution meant that merges have to be fast and work well -- this is a problem that most centralised systems struggle with because it's not a key part of the workflow. Branching and merging are indispensable tools in version control and I'm shocked how long CVS and SVN survived despite having completely broken systems for one or both. Being able to do both (and do stuff like blames over the entire history) without needing to communicate with a server is something that changes the way you work.
My actual hot take (as a kernel developer) is that email patches are good, actually. I really wish more people were aware of how incredibly powerful they are -- I have yet to see a source forge that can get close to the resiliency and flexibility of email (though SourceHut gets very close). git-send-email has its quirks, but b4 honestly makes it incredibly easy.
(There's also the owning your own data benefits that I think often go overlooked -- reflog and local clones mean that you cannot practically lose data even if the server goes away or you get blocked. I suspect Russian or Iranian developers losing access to their full repo history because of US sanctions wouldn't share your views on centralised services.)
themafia · 10h ago
> by queueing pushed changes locally
And if you and another developer make conflicting changes while offline? What should happen when you return online?
flohofwoe · 10h ago
The same thing as in git: you're not allowed to push your changes to the server until you have resolved the conflicts locally.
E.g. with current svn you get the latest changes from the server, open a diff editor, fix the conflicts and then commit.
The only difference here between svn and git is that svn merges the 'commit' and 'push' operations into one, e.g. instead of not being allowed to push, you're not allowed to commit in svn if there are pending conflicts.
This would be the part that would need to change if svn would get a proper 'offline mode', e.g. commits would need to go into some sort of 'local staging queue' until you get internet access back, and conflict resolutions would need to happen on the commits in that staging queue. But I really doubt if that's worth the hassle because how often are you actually without internet while coding?
linhns · 11h ago
> People should use the VCS that's appropriate for their project rather than insist on git everywhere.
But git is likely to be appropriate almost everywhere. You won’t just use svn just for big file purposes while git is better for everything else in the same project
flohofwoe · 9h ago
The thing is, in a game project, easily 99% of all version controlled data is in large binary files, and text files are by far the minority (at least by size). Yet still people try to use git for version control in game projects just because it is the popular option elsewhere and git is all they know.
IshKebab · 6h ago
> text files are by far the minority (at least by size)
Well yeah because text files are small. Thinking text files are insignificant to games because they are small is a really dumb perspective.
> Yet still people try to use git for version control in game projects just because it is the popular option elsewhere and git is all they know.
Or perhaps it's because it works really well for text files, which are a significant part of most games, and because the tooling is much better than for other VCS's.
flohofwoe · 5h ago
> Thinking text files are insignificant to games because they are small is a really dumb perspective.
Fact is that code is only one aspect of a game project, and arguably not the most important. Forcing a programmer-centric workflow on artists and designers is an even dumber perspective ;)
> and because the tooling is much better than for other VCS's
...only for text files. For assets like images, 3d models or audio data it's pretty much a wasteland.
IshKebab · 29m ago
Sure, which is why many studios don't use Git. I was just saying that your argument that code is unimportant because they are text files is super dumb.
jayd16 · 2h ago
Not because it's popular, but because all the tooling (CI/CD) around git is the best.
In games a lot of the tooling assumes P4 so it's often a better choice, on the whole, but if git and LFS was as widely supported in art tooling it would be the clear choice.
flohofwoe · 11h ago
> Git is intended to be fully distributed
Which is kinda funny because most people use git through Github or Gitlab, e.g. forcing git back into a centralized model ;)
> People should use the VCS that's appropriate for their project rather than insist on git everywhere.
Indeed, but I think that train has left long ago :)
I had to look it up after I wrote that paragraph about locking, but it looks like file locking is supported in Git (just weird that I need to press a button in the Gitlab UI to lock a file:
...why not simply 'git lock [file]' and 'git push --locks' like it works with tags?
jonathanlydall · 11h ago
If you’re making local commits (or local branch, or local merge, etc), you’re leveraging the distributed nature of Git. With SVN all these actions required being done on the server. This meant if you were offline and were thinking of making a risky change you might want to back out of, there was no way on SVN to make an offline commit which you could revert to if you wanted.
Of course if you’re working with others you will want a central Git server you all synchronize local changes with. GitHub is just one of many server options.
blacksqr · 1h ago
Shortly after git became popular, the Subversion team released a tool that enabled replicating a SVN repository or any subset to a local independent repository, which could be used for local development and re-synced to the main repository at any time.
flohofwoe · 11h ago
I'm pretty sure that if required, SVN could be updated to have an 'offline mode' where more offline operations would be allowed, since it keeps a local shadow copy of the last repository state from the server anyway. But offline mode simply isn't all that necessary anymore. How often is an average programmer in an area without any mobile coverage or other means of internet connectivity?
No comments yet
nomel · 11h ago
I very much dislike git (mostly because software hasn't been "just source" for many decades before git was hobbled together, and git's binary handling is a huge pain for me), but what does a lockfile achieve that a branch -> merge doesn't, practically, and even conceptually?
flohofwoe · 11h ago
Binary files can't be meaningfully merged. E.g. if two people create separate branches and work on the same Photoshop file, good luck trying to merge those changes, you can at best pick one version and the other user just wasted a lot of time. A lock communicates to other users "hey I'm currently working on that file, don't touch it" - this also means that lockable files must be write-only all the time unless explicitly locked - so that people don't "accidentially" start working on the file without acquiring a lock first. Apparently (as far as I have gathered so far from googling) git-lfs supports file locking but not vanilla git.
xg15 · 11h ago
> Binary files can't be meaningfully merged.
I think we really need more development of format-specific diff and merge tools. A lot of binary formats could absolutely be diffed or merged, but you'd need algorithms and maybe UIs specific to that format - there is no "generic" algorithm like for text-based files. (And even there, generic line-wise diffing if often more "good enough" than really good)
I think it would be awesome if we could get "diff/merge servers" analogous to the "language servers" for IDEs some day.
ramses0 · 5h ago
Git actually supports this, believe it or not, although it's a bit wonky (of course):
The alternative of preventing complex merge situations in the first place through file locking is low-tech, easy to implement, and automatically works on all current and future file formats.
xg15 · 9h ago
Well, a concrete pain point where I had this problem was Unity scene files (a few years ago). Unity stored not just the assets but also the scene information in binary files, which made integrating that into git an absolute pain. (They made matters worse by also storing "last edit" timestamps in certain files, so git would show changes everywhere even if there weren't any. But that's another topic)
The problem was that the scene information was fundamentally visual (assets arranged in 3D space) so even a diffable text format wouldn't help you much. On the other hand, scenes are large enough that you often would want to work on them in parallel with other people.
I believe their first solution to that was the Asset Server that supported locking. But that still doesn't give two people the ability to work on a scene concurrently.
Eventually, some users went and developed a custom diff/merge tool to solve the problem.
An alternative approach (less powerful but simpler) might be to reversibly convert the binary files into a mergeable text-like form before committing them.
I've never done exactly that but I have occasionally decided how information will be represented in a data file with merging in mind.
flohofwoe · 8h ago
edit: write-only => read-only of course :)
jayd16 · 2h ago
You should look into how git is architected to support a lot of features SVN doesn't (distributed repos is a big one). When you clone a git repo you clone the full file history. This is trivial for text but can be extremely large for binary files.
Storage is not storage as you can store things as copies or diffs (and a million other ways). For code, diffs are efficient but for binaries, diffs approach double the size of the original files, simply sending/storing the full file is better.
These differences have big effects on how git operates and many design choices assumed diffable text assets only.
If you do a little research, there's plenty of information on the topic.
technoweenie · 19h ago
I'm really happy to see large file support in Git core. Any external solution would have similar opt-in procedures. I really wanted it to work seamlessly with as few extra commands as possible, so the API was constrained to the smudge and clean filters in the '.gitattributes' file.
Though I did work hard to remove any vendor lock-in by working directly with Atlassian and Microsoft pretty early in the process. It was a great working relationship, with a lot of help from Atlassian in particular on the file locking API. LFS shipped open source with compatible support in 3 separate git hosts.
wbillingsley · 16h ago
What I used to recommend to my sofware engineering classes is that instead of putting large files (media etc) into Git, put them into the artifact repository (Artifactory or something like it). That lets you for instance publish it as a snapshot dependency that the build system will automatically fetch for you, but control how much history of it you keep and only require your colleagues to fetch the latest version. Even better, a simple clean of their build system cache will free up the space used by old versions on their machines.
StopDisinfo910 · 9h ago
People like storing everything in git because it significantly simplifies configuration management. A build can be cleanly linked to a git hash instead of being a hash and a bunch of artifacts versions especially if you vendor your dependencies in your source control and completely stop using an artifact repository.
With a good build system using a shared cache, it makes for a very pleasant development environment.
astrobe_ · 13h ago
It sounds like a submodule... But certainly if the problem could be solved with a submodule, people would have found out long ago. Git's submodules also support shallow-cloning already [1]. I can only guess what the issues are with large files since I didn't face it myself - I deal with pure source code most of the times. I'm interested to know why it would be a bad idea to do that, just in case. The caveats pointed out in the second SO answer don't seem to be a big deal.
It sounds different to me - a regular git submodule would keep all history, unlike a file storage with occasional snapshotting.
firesteelrain · 16h ago
Do you teach CI/CD systems architecture in your classes? Because I am finding that is what the junior engineers that we have hired seem to be missing.
Tying it all in with GitLab, Artifactory, CodeSonar, Anchore etc
TZubiri · 14h ago
I think the OP refers to assets that truly belong in Git because they are source code but large, like 3d models.
Release artifacts like a .exe would NOT belong in Git because it is not source code.
firesteelrain · 6h ago
I get it, I understood. I was forking the conversation for a sec
wbillingsley · 11h ago
Yes
cyberax · 13h ago
This has its own issues. Now you need to provision additional credentials into your CI/CD and to your developers.
Commits become multi-step, as you need to first commit the artifacts to get their artifact IDs to put in the repo. You can automate that via git hooks, but then you're back at where you started: git-lfs.
glitchc · 21h ago
No. This is not a solution.
While git LFS is just a kludge for now, writing a filter argument during the clone operation is not the long-term solution either.
Git clone is the very first command most people will run when learning how to use git. Emphasized for effect: the very first command.
Will they remember to write the filter? Maybe, if the tutorial to the cool codebase they're trying to access mentions it. Maybe not. What happens if they don't? It may take a long time without any obvious indication. And if they do? The cloned repo might not be compilable/usable since the blobs are missing.
Say they do get it right. Will they understand it? Most likely not. We are exposing the inner workings of git on the very first command they learn. What's a blob? Why do I need to filter on it? Where are blobs stored? It's classic abstraction leakage.
This is a solved problem: Rsync does it. Just port the bloody implementation over. It does mean supporting alternative representations or moving away from blobs altogether, which git maintainers seem unwilling to do.
bogwog · 7m ago
Maybe a manual filter isn't the right solution, but this does seem to add a lot of missing pieces.
The first time you try to commit on a new install, git nags you to set your email address and name. I could see something similar happen the first time you clone a repo that hits the default global filter size, with instructions on how to disable it globally.
> The cloned repo might not be compilable/usable since the blobs are missing.
Maybe I misunderstood the article, but isn't the point of the filter to prevent downloading the full history of big files, and instead only check out the required version (like LFS does).
So a filter of 1 byte will always give you a working tree, but trying to checkout a prior commit will require a full download of all files.
IshKebab · 21h ago
I totally agree. This follows a long tradition of Git "fixing" things by adding a flag that 99% of users won't ever discover. They never fix the defaults.
And yes, you can fix defaults without breaking backwards compatibility.
Jenk · 20h ago
> They never fix the defaults
Not strictly true. They did change the default push behaviour from "matching" to "simple" in Git 2.0.
hinkley · 20h ago
So what was the second time the stopped watch was right?
I agree with GP. The git community is very fond of doing checkbox fixes for team problems that aren’t or can’t be set as defaults and so require constant user intervention to work. See also some of the sparse checkout systems and adding notes to commits after the fact. They only work if you turn every pull and push into a flurry of activity. Which means they will never work from your IDE. Those are non fixes that pollute the space for actual fixes.
smohare · 19h ago
I’ve used git since its inception. Never once in an “IDE”. Should users that refuse to learn the tool really be the target?
I’m not trying to argue that interface doesn’t matter. I use jq enough to be in that unfortunate category where I despise its interface. But it is difficult for me to imagine being similarly incapable in git.
TGower · 20h ago
> The cloned repo might not be compilable/usable since the blobs are missing.
Only the histories of the blobs are filtered out.
ks2048 · 20h ago
> This is a solved problem: Rsync does it.
Can you explain what the solution is? I don't mean the details of the rsync algorithm, but rather what it would like like from the users' perspective. What files are on your local filesystem when you do a "git clone"?
hinkley · 20h ago
When you do a shallow clone, no files would be present. However when doing a full clone you’ll get a full copy of each version of each blob, and what is being suggested is treat each revision as an rsync operation upon the last. And the more times you muck with a file, which can happen a lot both with assets and if you check in your deps to get exact snapshotting of code, that’s a lot of big file churn.
tatersolid · 15h ago
The overwhelming majority of large assets (images, audio, video) will receive near-zero benefit from using the rsync algorithm. The formats generally have massive byte-level differences even after small “tweaks” to a file.
TZubiri · 14h ago
Video might be strictly out of scope for git, consider that not even youtube allows 'updating' a video.
This will sound absolutely insane, but maybe the source code for the video should be a script? Then the process of building produces a video which is a release artifact?
qubidt · 10h ago
This is relatively niche, but that's a thing for anime fan-encodes. Some groups publish their vapoursynth scripts, allow you to produce the same re-encoding (given you have the same source video). e.g.:
That is nowhere near practical for even basic use cases like a website or a mobile app.
cyberax · 13h ago
A lot of video editing includes splicing/deleting some footage, rather than full video rework. rsync, with its rolling hash approach, can work wonders for this use-case.
expenses3 · 8h ago
Exactly. If large files suck in git then that's because the git backend and cloning mechanism sucks for them. Fix that and then let us move on.
spyrja · 20h ago
Would it be incorrect to say that most of the bloat relates to historical revisions? If so, maybe an rsync-like behavior starting with the most current version of the files would be the best starting point. (Which is all most people will need anyhow.)
pizza234 · 20h ago
> Would it be incorrect to say that most of the bloat relates to historical revisions?
Based on my experience (YMMV), I think it is incorrect, yes, because any time I've performed a shallow clone of a repository, the saving wasn't as much as one would intuitively imagine (in other words: history is stored very efficiently).
spyrja · 19h ago
Doing a bit of digging seems to confirm that, considering that git actually does remove a lot of redundant files during the garbage collection phase. It does however store complete files (unlike a VCS like mercurial which stores deltas) so nonetheless it still might benefit from a download-the-current-snapshot-first approach.
cesarb · 9h ago
> It does however store complete files (unlike a VCS like mercurial which stores deltas)
The logical model of git is that it stores complete files. The physical model of git is that these complete files are stored as deltas within pack files (except for new files which haven't been packed yet; by default git automatically packs once there are too many of these loose files, and they're always packed in its network protocol when sending or receiving).
olddustytrail · 2h ago
Yes, the problem really stems from the fact that git "understands" text files but not really anything other than that, so it can't really make a good diff between say a jpeg and its updated version, so it simply relies on compression for those other formats.
It would be nice to have a VCS that could manage these more effectively but most binary formats don't lend themselves to that, even when it might be an additional layer to an image.
I reckon there's still room for better image and video formats that would work better with VCS.
TZubiri · 14h ago
"Will they remember to write the filter? Maybe, "
Nothing wrong with "forgetting" to write the filter, and then if it's taking more than 10 minutes, write the filter.
Too · 14h ago
What? Why would you want to expose a beginner to waiting 10 minutes unnecessarily. How would they even know what they did wrong or what's a reasonable time to wait, ask chatgpt "why is my git clone taking 10 minutes"?!
Is this really the best we can do in terms of user experience? No. git need to step up.
TZubiri · 3h ago
Git is not for beginners in general. And large repos are less for beginners.
A beginner will follow instructions in a README "Run git clone" or "run git clone --depth=1
matheusmoreira · 20h ago
It is a solution. The fact beginners might not understand it doesn't really matter, solutions need not perish on that alone. Clone is a command people usually run once while setting up a repository. Maybe the case could be made that this behavior should be the default and that full clones should be opt-in but that's a separate issue.
gschoeni · 18h ago
We're working on `oxen` to solve a lot of the problems we ran into with git or git-lfs.
We have an open source CLI and server that mirrors git, but handles large files and mono repos with millions of files in a much more performant manner. Would love feedback if you want to check it out!
How about git just fixes shallow clones and partial clones? Then we don't need convoluted work arounds to cheat in large content after we fully clone a history of pointers or promises or whatever. You should be able to set default clone depth by file type and size in the git attributes (or maybe a file that can also live above a repo like supporting attributes in .gitconfig locations?).
Then the usual settings would be to shallow clone the latest content as well as fetch the full history and maybe the text file historical content. Ideally you could prune to the clone depth settings as well.
Why are we still talking about large file pointers? If you fix shallow and partial clones, then any repo can be an efficient file mirror, right?
jameshart · 20h ago
Nit:
> if I git clone a repo with many revisions of a noisome 25 MB PNG file
FYI ‘noisome’ is not a synonym for ‘noisy’ - it’s more of a synonym for ‘noxious’; it means something smells bad.
williadc · 19h ago
I believe that was the author's intent.
jameshart · 8h ago
I guess maybe it’s the nonstandard sMEL chunk that bumps the size of the PNG file up so high. Seemed more to me that they were talking about an image of random noise though.
bahmboo · 20h ago
I'm just dipping my toe into Data Version Control - DVC. It is aimed towards data science and large digital asset management using configurable storage sources under a git meta layer. The goal is separation of concerns: git is used for versioning and the storage layers are dumb storage.
Does anyone have feedback about personally using DVC vs LFS?
fvncc · 17h ago
When I tried DVC ~5 years ago it was very slow as it constantly hashed files for some reason.
Dud author here. Happy to hear it's working well for you!
memmel · 17h ago
I'm in the same boat - I decided this week for DVC over LFS.
For me, the deciding factor was that with LFS, if you want to delete objects from storage, you have to rewrite git history. At least, that's what both the Github and Gitlab docs specify.
DVC adds a layer of indirection, so that its structure is not directly tied to git. If I change my mind and delete the objects from S3, dvc might stop working, but git will be fine.
Some extra pluses about DVC:
- It can point to versioned S3 objects that you might already have as part of existing data pipelines.
- It integrates with the Python fsspec library to read the files on demand using paths like "dvc://path/to/file.parquet". This feels nicer than needing to download all the files up front.
Evidlo · 18h ago
I did a simple test tracking a few hundred gigs of random /dev/urandom data. LFS choked on upload speed while DVC worked fine. My team is using DVC now
sthoward · 7h ago
We built `oxen` to solve the issues we (and many others) had with DVC and LFS. The highlights: open source cli and server, mirrors git for easy learning, handles large files and millions of files, performant af. Would love feedback if you check it out.
My main complaint about DVC is that it's hard to manage the files and if you keep modifying a big file you are going to end up with all the revisions stored in S3 (or whichever storage you choose). This is by design but I wish it was easier to set up like "store only the latest 3 revisions"
ideasman42 · 14h ago
Something they missed out with the GitLFS cons is authentication, if you don't use SSH-agent, pushing involves authenticating multiple times, sometimes more than 2-3 times in my experience.
goneri · 21h ago
git-annex is a good alternative to the solution of Githu, and it supports different storage backends. I'm actually surprised it's not more popular.
alchemist1e9 · 14h ago
Was looking for a mention of git-annex and I completely agree. I’ve used it extensively and have found it works really well.
Any ideas why it isn’t more popular and more well known?
avar · 7h ago
I use it, and love it.
But it's not intended for or good at (without forcing a square peg into a round hole) the sort of thing LFS and promisors are for, which is a public project with binary assets.
git-annex is really for (and shines at) a private backup solution where you'd like to have N copies of some data around on various storage devices, track the history of each copy, ensure that you have at least N copies etc.
Each repository gets a UUID, and each tracked file has a SHA-256 hash. There's a branch which has a timestamp and repo UUID to SHA-256 mapping, if you have 10 repos that file will have (at least) 10 entries.
You can "trust" different repositories to different degrees, e.g. if you're storing a file on both some RAID'd storage server, or an old portable HD you're keeping in a desk drawer.
This really doesn't scale for a public project. E.g. I have a repository that I back up my photos and videos in, that repository has ~700 commits, and ~6000 commits to the metadata "git-annex" branch, pretty close to a 1:10 ratio.
There's an exhaustive history of every file movement that's ever occurred on the 10 storage devices I've ever used for that repository. Now imagine doing all that on a project used by more than one person.
All other solutions to tracking large files along with a git repository forgo all this complexity in favor of basically saying "just get the rest where you cloned me from, they'll have it!".
cesarb · 8h ago
> Any ideas why it isn’t more popular and more well known?
While git-annex works very well on Unix-style systems with Unix-style filesystems, it heavily depends on symbolic links, which do not exist on filesystems like exFAT, and are problematic on Windows (AFAIK, you have to be an administrator, or enable an obscure group policy). It has a degraded mode for these filesystems, but uses twice the disk space in that mode, and AFAIK loses some features.
pabs3 · 15h ago
The git storage model also needs an overhaul to be more like modern backup tools such as restic/borg; it needs content-defined chunking of files and directories.
tpoacher · 42m ago
That or rediscover the beauty of svn
(only half-trolling)
astrolx · 6h ago
For what it's worth, I have a self hosted git forge where I upped the maximum filesize limit. I use git repositories for my science projects, they can each have hundreds of large files (> 25 Mb as described in the article).
I don't use LFS, I don't encounter any issue.
tionis · 4h ago
I've been using git + git-lfs to manage, sync and backup all my files (including media files and more) and it's quite convenient, but native support for large files would be great. I'd for example really like to be able to push large objects directly from one device to the next.
At the moment I'm using my own git server + git lfs deduplication using btrfs to efficiently handle the large files.
If large objects are just embedded in various packfiles this approach would no longer work, so I hope that such a behaviour can be controlled.
HexDecOctBin · 20h ago
So this filter argument will reduce the repo size when cloning, but how will one reduce the repo size after a long stint of local commits of changing binary assets? Delete the repo and clone again?
viraptor · 20h ago
It's really not clear which behaviour you want though. For example when you do lots of bisects you probably want to keep everything downloaded locally. If you're just working on new things, you may want to prune the old blobs. This information only exists in your head though.
HexDecOctBin · 19h ago
The ideal behaviour is so have a filter on push too, meaning that files above a certain size should be deleted from non-latest history after push.
viraptor · 18h ago
That would prevent old revisions from working... Why would that be ideal?
HexDecOctBin · 13h ago
Why would it stop old revisions from working? What would be the difference between cloning with filter on and delete local versions from old commits?
viraptor · 6h ago
I thought you want to prune the destination on push. Pruning local may work for some. It would be extremely annoying for me, because I'm way more likely to dig in old commits than to push. And I really don't want the existing blobs removed without an explicit consent.
firesteelrain · 19h ago
Yes once it gets bad enough your only option is to abandon and move the source code only. Your old repo has the history pre abandon.
bobmcnamara · 5h ago
There is a trick for this, where you can setup a new repo to consider another as pre-initial-commit source of history.
firesteelrain · 3h ago
Does that cause all the binaries in LFS to come over too?
bobmcnamara · 2h ago
I can't imagine it would If you were moving both git and LFS.
The old repo will still be pointed to whatever the LFS config was at that time. If that service is still up, it should continue to work.
firesteelrain · 1h ago
Ah my point is even with LFS enabled and you don’t store them external to git that the binaries are still totally part of the history (and really slow down cloning)
In my case, with a 25GB repo, it was really detrimental to performance
actinium226 · 19h ago
For lots of local edits you can squash commits using the rebase command with the interactive flag.
reactordev · 20h ago
yeah, this isn't really solving the problem. It's just punting it. While I welcome a short-circuit filter, I see dragons ahead. Dependencies. Assets. Models... won't benefit at all as these repos need the large files - hence why there are large files.
rezonant · 19h ago
There seems to be a misunderstanding. The --filter option simple doesn't populate content in the .git directory which is not required for the checkout. If there is a file that is large which is needed for the current checkout (ie the parts not in the .git folder), it will be fetched regardless of the filter option.
To put it another way, regardless of what max size you give to --filter, you will end up with a complete git checkout, no missing files.
actuallyalys · 18h ago
It’s definitely not a full solution, but it seems like it would solve cases where having the full history of the large files available, just not on everyone’s machine, is the desired behavior.
als0 · 21h ago
10 years late is better than never.
nixpulvis · 20h ago
I was just using git LFS and was very concerned with how bad the help message was compared to the rest of git. I know it seems small, but it just never felt like a team player, and now I'm very happy to hear this.
tombert · 21h ago
Is Git ever going to get proper support for binary files?
I’ve never used it for anything serious but my understanding is that Mercurial handles binary files better? Like it supports binary diffs if I understand correctly.
Any reason Git couldn’t get that?
ks2048 · 20h ago
I'm not sure binary diffs are the problem - e.g. for storing images or MP3s, binary diffs are usually worse than nothing.
digikata · 20h ago
I would think that git would need a parallel storage scheme for binaries. Something that does binary chunking and deduplication between revisions, but keeps the same merkle referencing scheme as everything else.
tempay · 20h ago
> binary chunking and deduplication
Are there many binaries that people would store in git where this would actually help? I assume most files end up with compression or some other form of randomization between revisions making deduplication futile.
adastra22 · 19h ago
A lot in the game and visual art industries.
zigzag312 · 9h ago
2-3x reduction in repository size compared to Git LFS in this test:
I don't know, it's all probability in the dataset that makes one optimization strategy better over another. Git annex iirc does file level dedupe. That would take care of most of the problem if you're storing binaries that are compressed or encrypted. It's a lot of work to go beyond that, and probably one reason no one has bothered with git yet. But borg and restic both do chunked dedupe I think.
hinkley · 20h ago
It would likely require more tooling.
zigzag312 · 10h ago
Xet uses block level deduplication.
zigzag312 · 7h ago
> for storing images or MP3s, binary diffs are usually worse than nothing
Editing the ID3 tag of an MP3 file or changing the rating metadata of an image will give a big advantage to block level deduplication. Only a few such cases are needed to more than compensate for that worse than nothing inefficiencies of binary diffs when there's nothing to deduplicate.
firesteelrain · 19h ago
A lot of people use Perforce Helix and others use Plastic SCM. That’s been my experience for like large binary assets with git-like functionality
tom_ · 18h ago
I didn't enjoy using Plastic, but Perforce is ok (not to say that it's perfect - I miss a lot of git stuff). It does have no problems with lots of data though! This article moans about the overhead of a 25 MB png file... it's been a long time since i worked on a serious project where the head revision is less than 25 GB. Typical daily churn would be 2.5 GB+.
(It's been even longer since i used svn in anger, but maybe it could work too. It has file locking, and local storage cost is proportional to size of head revision. It was manageable enough with a 2 GB head revision. Metadata access speed was always terrible though, which was tedious.)
firesteelrain · 17h ago
SVN should be able to handle large files no issue imho
afiori · 11h ago
My understanding is that git diff algorithms require a file to be segmentable (eg text files are split line-wise) and there is no general segmentation strategy for binary blobs.
But a good segmentation is only good for better compression and nicer diff, git could do byte wise diffs with no issues, so I wonder why doesn't git use customizable segmentation strategies where it calls external tools based on file type (eg a rust thingy for rust file etc, or a PNG thingy for PNG files).
At worst the tool would return either a single segment for the entire file or the byte wise split which would work anyway
joeyh · 6h ago
A common misconception. git has always used binary deltas for pack files. Consider that git tree objects are themselves not text files, and git needs efficiently store slightly modified versions of the same tree.
brucehoult · 17h ago
All files in git are binary files.
All deltas between versions are binary diffs.
Git has always handled large (including large binary) files just fine.
What it doesn't like is files where a conceptually minor change changes the entire file, for example compressed or encrypted files.
The only somewhat valid complaint is that if someone once committed a large file and then it was later deleted (maybe minutes later, maybe years later) then it is in the repo and in everyone's checkouts forever. Which applies equally to small and to large files, but large ones have more impact.
That's the whole point of a version control system. To preserve the history, allowing earlier versions to be recreated.
The better solution would be to have better review of changes pushed to the master repo, including having unreviewed changes in separate, potentially sacrificial, repos until approved.
kerneltime · 18h ago
https://github.com/oneconcern/datamon
Had written this git for data tool few years back (works with GCS but can be made to work with S3)
1. No server side
2. Immutable data (via GCS policies)
3. Ability to mount data sets as filesystems
4. Integrated with k8s.
It was built to work for the needs of the startup funding it, but I would love it if it could be extended.
rswail · 9h ago
Is the problem that we don't have good "diff" equivalents for binaries that git could use to only store those diffs like the old RCS/CVS for large files?
procaryote · 9h ago
subversion used to do that, actually probably still does... and also only checks out the latest revision. Svn is a bother in other ways of course, like being worse at regular version control, and only usable with access to the server etc.
There's a bunch of binary files that change a lot on small changes due to compression or how the data is serialised, so the problem doesn't go away completely. One could conceivably start handling that, but there are lots of file formats out there and the sum oc complexity tends to be bugs and security issues.
rswail · 8h ago
Potentially with a new blob type but maintaining a reverse diff would be difficult as it would change the hash of the previous version if you had to store the diff.
Another alternative would be storing the chunks as blobs so that you reconstruct the full binary and only have to store the changed chunks. However that doesn't work with compressed binaries.
IshKebab · 6h ago
Not really. Git does use delta-based storage for binary files. It might not be as good as it could be for some files (e.g. compressed ones) but that's relatively easy to solve.
The real problem is that Git wants you to have a full copy of all files that have ever existed in the repo. As soon as you add a large file to a repo it's there forever and can basically never be removed. If you keep editing it you'll build up lots more permanent data in the repo.
Git is really missing:
1. A way to delete old data.
2. A way for the repo to indicate which data is probably not needed (old large binaries).
3. A way to serve large files efficiently (from a CDN).
Some of these can sort of be done, but it's super janky. You have to proactively add confusing flags etc.
mathi0750 · 18h ago
Have you tried Oxen.ai? they are doing more fine-tuning and inference now but they have an open-source data version control platform written in rust at the core of their product.
captn3m0 · 17h ago
Partial clones are also dependent on the server side supporting this. GitHub is one of the very few that does. git.kernel.org for eg did not, last I checked.
moonlion_eth · 16h ago
the future of git is jj
expenses3 · 8h ago
Does it handle large files better? I thought it was just an improved interface over the git store.
a_t48 · 18h ago
The real GH LFS cost is not the storage but the bandwidth on pulling objects down for every fresh clone. $$$$$. See my other comment. :)
Imustaskforhelp · 6h ago
xet on hugging face does seem to not have such bandwidth issues imo. I wish that something like xet but open source could exist.
CSMastermind · 16h ago
I could really use an alternative to Plastic SCM for 3D models. S3 is fine but lacks the nicities.
jiggawatts · 20h ago
What I would love to see in an SCM that properly supports large binary blobs is storing the contents using Prolly trees instead of a simple SHA hash.
Prolly trees are very similar to Merkle trees or the rsync algorithm, but they support mutation and version history retention with some nice properties. For example: you always obtain exactly the same tree (with the same root hash) irrespective of the order of incremental edit operations used to get to the same state.
In other words, two users could edit a subset of a 1 TB file, both could merge their edits, and both will then agree on the root hash without having to re-hash or even download the entire file!
Another major advantage on modern many-core CPUs is that Prolly trees can be constructed in parallel instead of having to be streamed sequentially on one thread.
Then the really big brained move is to store the entire SCM repo as a single Prolly tree for efficient incremental downloads, merges, or whatever. I.e.: a repo fork could share storage with the original not just up to the point-in-time of the fork, but all future changes too.
hinkley · 20h ago
Git has had a good run. Maybe it’s time for a new system built by someone who learned about DX early in their career, instead of via their own bug database.
If there’s a new algorithm out there that warrants a look…
viraptor · 19h ago
Jujutsu unfortunately doesn't have any story for large files yet (as far as I can tell), but maybe soon
...
We have also talked about doing something similar for tree objects in order to better support very large directories (to reduce the amount of data we need to transfer for them) and very deep directories (to reduce the number of roundtrips to the server). I think we have only talked about that on Discord so far (https://discord.com/channels/968932220549103686/969291218347...). It would not be compatible with Git repos, so it would only really be useful to teams outside Google once there's an external jj-native forge that decides to support it (if our rough design is even realistic).
dilap · 4h ago
Lack of git-lfs support (hack tho it is) is the only thing currently keeping me from trying (and, assuming it's as nice as it seems, using!) jj. Built-in, better large-file support would be amazing, and I'd happily just ditch git completely to gain that. I think there's probably a lot of people in the same boat.
Imustaskforhelp · 6h ago
This seems all the more reason to spin up an alternative to github but a jj-native forge.
Maybe we can fork something like codeberg's ui but use jj or maybe the jujutsu team can work with codeberg itself? I am pretty sure that codeberg team is really nice and this could be an experimental feature, which if it really needs to can be crowdfunded by community.
I will chip in the first dollar.
Dylan16807 · 12h ago
Can you list some realistic workflows where people would be touching the same huge file but only changing much smaller parts of it?
And yes you can represent a whole repo as a giant tar file, but because the boundaries between hash segments won't line up with your file boundaries you get an efficiency hit with very little benefit. Unless you make it file-aware in which case it ends up even closer to what git already does.
Git knows how to store deltas between files. Making that mechanism more reliable is probably able to achieve more with less.
bobmcnamara · 4h ago
Most Microsoft office documents.
One of our projects has a UI editor with a 60MB file for nearly everything except images, and people work on different UI flows at the same time.
jiggawatts · 12h ago
Binary database files containing “master data”.
Merging would require support from the DB engine, however.
anon-3988 · 18h ago
What prevents Git from simply working better with large files?
AceJohnny2 · 18h ago
git works just fine with large files. The problem is that when you clone a repo, or pull, by default it gets everything, including large files deep in the history that you probably don't care about anymore.
That was actually an initial selling point of git: you have the full history locally. You can work from the plane/train/deserted island just fine.
These large files will persist in the repo forever. So people look for options to segregate large files out so that they only get downloaded on demand (aka "lazily").
All the existing options (submodules, LFS, partial clones) are different answers to "how do we make certain files only download on demand"
expenses3 · 8h ago
No, git does not work 'just fine' with large files. It works like ass.
nomel · 15h ago
> All the existing options
Don't forget sparse checkouts!
anon-3988 · 17h ago
IIRC, it take ages for it to index a large folder. I was trying to use it to store the diff of my backup folder that constantly get rclone'd and rsync'd over in case those fucked up catastrophically
Affric · 21h ago
Incredible.
Nice to see some Microsoft and Google emails contributing.
As it should be! If it's not native to git, it's not worth using. I'm glad these issues are finally being solved.
These new features are pretty awesome too. Especially separate large object remotes. They will probably enable git to be used for even more things than it's already being used for. They will enable new ways to work with git.
sublinear · 20h ago
May I humbly suggest that those files probably belong in an LFS submodule called "assets" or "vendor"?
Then you can clone without checking out all the unnecessary large files to get a working build, This also helps on the legal side to correctly license your repos.
I'm struggling to see how this is a problem with git and not just antipatterns that arise from badly organized projects.
nomel · 15h ago
The problem I've run into with this is that those files stay in the history. Your git clones will get ridiculous, and you'll blast through any git repo size limits that you might have.
I just want my files to match what's expected when I pull a commit, that doesn't require some literal "commit build system" and "pull build system". Coming from perforce and SVN, I can't comprehend why git it so popular, beyond cargo cult. It's completely nonsensical to think that software is just source.
charcircuit · 20h ago
The user shouldn't have to think about such a thing. Version control should handle everything automatically and not force the user into doing extra work to workaround issues.
hinkley · 20h ago
I always hated the “write your code like the next maintainer is a psychopath” mantra because it makes the goal unclear. I prefer the following:
Write your code/tools as if they will be used at 2:00 am while the server room is on fire. Because sooner or later they will be.
A lot of our processes are used like emergency procedures. Emergency procedures are meant to be brainless as much as possible. So you can reserve the rest of your capacity for the actual problem. My version essentially calls out Kernighan’s Law.
sublinear · 19h ago
Organizing your files sensibly is not necessary to use LFS nor is it a "workaround". It's just a pattern I am suggesting to make life easier regardless of what tools you decide to use. I can't think of a case where organizing your project to fail gracefully is a bad idea.
Git does the responsible thing and lets the user determine how to proceed with the mess they've made.
I must say I'm increasingly suspicious of the hate that git receives these days.
ghhv · 15h ago
YtmiA 2
forrestthewoods · 20h ago
Git is fundamentally broken and bad. Almost all projects are defacto centralized. Your project is not Linux.
A good version control system would support petabyte scale history and terabyte scale clones via sparse virtual filesystem.
Git’s design is just bad for almost all projects that aren’t Linux.
(I know this will get downvoted. But most modern programmers have never used anything but Git and so they don’t realize their tool is actually quite bad! It’s a shame.)
codethief · 20h ago
> A good version control system would support petabyte scale history and terabyte scale clones via sparse virtual filesystem.
I like this idea in principle but I always wonder what that would look in practice, outside a FAANG company: How do you ensure the virtual file system works equally well on all platforms, without root access, possibly even inside containers? How do you ensure it's fast? What do you do in case of network errors?
Someone just needs to do. Numerous companies have built their own cross-platform VFS layers. It’s hard but not intractable.
Re network errors. How many things break when GitHub is down? Quite a lot! This isn’t particularly special. Prefetch and clone are the same operation.
ants_everywhere · 20h ago
Yeah we're at the CVS stage where everyone uses it because everyone uses it.
But most people don't need most of its features and many people need features it doesn't have.
If you look up git worktrees, you'll find a lot of blog articles referring to worktrees as a "secret weapon" or similar. So git's secret weapon is a mode that lets you work around the ugliness of branches. This suggests that many people would be better suited by an SCM that isn't branch-based.
It's nice having the full history offline. But the scaling problems force people to adopt a workflow where they have a large number of small git repos instead of keeping the history of related things together. I think there are better designs out there for the typical open source project.
DonHopkins · 20h ago
Git now has artificial feet to aim the foot guns at so they hit the right target.
matheusmoreira · 17h ago
I don't understand what you mean by "the ugliness of branches".
In my experience, branches are totally awesome. Worktrees make branches even more awesome because they let me check out multiple branches at once to separate directories.
The only way it could get better is if it somehow gains the ability to check out the same branch to multiple different directories at once.
matheusmoreira · 20h ago
Completely disagree. Git is fundamentally functional and good. All projects are local and decentralized, and any "centralization" is in fact just git hosting services, of which there are many options which are not even mutually exclusive.
compiler-guy · 18h ago
Got works fine and is solid and well enough known to be a reasonable choice for most people.
But I encourage everyone to try out a few alternatives (and adopt their workflows at least for a while). I have no idea if you have or not.
But fine has never used the alternatives, one doesn’t really know just how nice things can be. Or, even if you still find fit to be your preferred can, having an alternative experience can open you to other possibilities and ways of working.
Just like everyone should try a couple of different programming languages or editors or anything else for size. You may not end up choosing it, but seeing the possibilities and different ways of thinking is a very good thing.
ItsHarper · 13h ago
Yeah, the decentralized design is incredibly useful in the day-to-day, for ~any project size.
forrestthewoods · 12h ago
Incorrect. All the features you think are associated with the D in DVCS are perfectly accessible to a more centralized tool.
the_arun · 19h ago
Are you missing the central hosting services provide a good backup plan for your locally hosted git?
matheusmoreira · 19h ago
I agree! They are excellent git backup services. I use several of them: github, codeberg, gitlab, sourcehut. I can easily set up remotes to push to all of them at once. I also have copies of my important repositories on all my personal computers, including my phone.
This is only possible because git is decentralized. Claiming that git is centralized is complete falsehood.
jokoon · 13h ago
Git lfs seems controversial in the comments here
Developers have their own drama
whatever1 · 18h ago
It is insane that almost after a century of running computations with data on computers we still don't have a good version control system that maps a code version to its relevant data version.
Still the approach is to put code and data in a folder and call it a day. Slap a "_FINAL" at the folder name and you are golden.
No comments yet
firesteelrain · 19h ago
We had a repo that was at one point 25GB. It had Git LFS turned on but the files weren’t stored outside of BitBucket. Whenever a build was run in Bamboo, it would choke big time.
We found that we could move the large files to Artifactory as it has Git LFS support.
But the problem was the entire history that did not have Artifactory pointers. Every clone included the large files (for some reason the filter functionality wouldn’t work for us - it was a large repo and it it had hundreds of users amongst other problems)
Anyways what we ended up doing was closing that repo and opening a new one with the large files stripped.
Nitpick in the authors page:
“ Nowadays, there’s a free tier, but you’re dependent on the whims of GitHub to set pricing. Today, a 50GB repo on GitHub will cost $40/year for storage”
This is not true as you don’t need GitHub to get LFS support
You specify the files you want to store in your storage backend via .gitattributes, and use two separate commands to sync files. I have not touched this code in years but the general implementation should still work.
GitHub launched LFS not too long after I wrote this, so I kind of gave up on the idea thinking that no one would want to use it in lieu of GitHub's solution, but based on the comments I think there's a place for it.
It needs some love but the idea is solid. I wrote a little description on the wiki about the low-level implementation if you want to check it out. [1]
Also, all of the metadata is stored using git notes, so is completely portable and is frontend agnostic—doesn't lock you into anything (except, of course, the storage backend you use).
[0]: https://github.com/lionheart/git-bigstore
[1]: https://github.com/lionheart/git-bigstore/wiki
I like this approach. If I could configure my repos to use something like S3, I would switch away from using LFS. S3 seems like a really good synergy for large blobs in a VCS. The intelligent tiering feature can move data into colder tiers of storage as history naturally accumulates and old things are forgotten. I wouldn't mind a historical checkout taking half a day (i.e., restored from a robotic tape library) if I am pulling in stuff from a decade ago.
Edit: Particularly the hash algorithm and the change detection (also when this happens) makes a difference if you have 2 GB files and not only the 25MB file from the OP
I'd initially at spinning up an LFS backends, but this solves the main pain point, for now. Github was charging us an arm and a leg for pulling LFS files for CI, because each checkout is fresh, the caching model is non-ideal (max 10GB cache, impossible to share between branches), so we end up pulling a bunch of data that is unfortunately in LFS, every commit, possibly multiple times. Because of this they happily charge us for all that bandwidth, because they don't provide tools to make it easy to reduce bandwidth (let me pay for more cache size, or warm workers with an entire cache disc, or better cache control, or...).
...and if I want to enable this for developers it's relatively easy, just add a new git hook to do the same set of operations locally.
[1] https://github.com/rwx-cloud/packages/blob/main/git/clone/bi...
All our builds are on GHA definitions, there’s no way it’s worth it to swap us over to another build system, administer it, etc. Our team is small (two at the time, but hopefully doubling soon!), and there’s barely a dozen people in the whole engineering org. The next hit list item is to move from GH hosted builders to GCE workers to get a warmer docker cache (a bunch of our build time is spent pulling images that haven’t changed) - it will also save a chunk of change (GCE workers are 4x cheaper per minute and the caching will make for faster builds), but the opportunity cost for me tackling that is quite high.
And even if that did work, I’ve found it much more reliable to use the actual docker BuildX disk state than to try and get caching for complex multi stage builds working reliably. I have a case right now where there’s no combination of —cache-to/from flags that will give me a 100% cached rebuild starting from a fresh builder, using only remote cache. I should probably report it to the Docker team, but I don’t have a minimal repro right now and there’s a 10% chance it’s actually my fault.
Apparently, this is coming in Q3 according to their public roadmap: https://github.com/github/roadmap/issues/1029
I run a small git LFS server because of this and will be happy to switch away the second I can get git to natively support S3.
There are a couple other projects that bridge S3 and LFS, though I had the most success with this setup.
If you fancy it for your datacenter, big players (Fujitsu, Lenovo, Huawei, HPE) will happily sell you "object storage" systems which also support S3 at very high speeds.
Scality's open source S3 Server also can run in a container.
LFS does break disconnected/offline/sneakernet operations which wasn't mentioned and is not awesome, but those are niche workflows. It sounds like that would also be broken with promisors.
The `git partial clone` examples are cool!
The description of Large Object Promisors makes it sound like they take the client-side complexity in LFS, move it server-side, and then increases the complexity? Instead of the client uploading to a git server and to a LFS server it uploads to a git server which in turn uploads to an object store, but the client will download directly from the object store? Obviously different tradeoffs there. I'm curious how often people will get bit by uploading to public git servers which upload to hidden promisor remotes.
I dunno if their solution is any better but it's fairly unarguable that LFS is bad.
1. It is a separate tool that has to be installed separately from git
2. It works by using git filters and git hooks, which need to be set up locally.
Something built in to git doesn't have those problems.
But GP's point was that there is an entire other category of errors with git-lfs that are eliminated with this more native approach. Git-lfs allows you to get into an inconsistent state e.g. when you interrupt a git action that just doesn't happen with native git.
The architecture does seem to still be in the general framing of "treat large files as special and host them differently." That is the crux of the problem in the first place.
I think it would shock no one to find that the official system also needs to be enabled and also falls back to a mode where it supports fetching and merging pointers without full file content.
I do hope all the UX problems will be fixed. I just don't see them going away naturally and we have to put our trust in the hope that the git maintainers will make enjoyable, seamless and safe commands.
Mostly I did not run into such use case but in general I don’t see any upsides trying to shove some big files together with code within repositories.
That is why I don’t understand why people „need to use GIT”.
You still can make something else like keeping versions and keeping track of those versions in many different ways.
You can store a reference in repo like a link or whatever.
Wanting to split up the project into multiple storage spaces is inherently hostile to managing the project. People want it together because it's important that it stays together as a basic function of managing a project of digital files. The need to track and maintain digital version numbers and linking them to release numbers and build plans is just a requirement.
That's what actual, real projects demand. Any projects that involve digital assets is going to involve binary, often large, data files. Any projects that involve large tables of pre-determined or historic data will involve large files that may be text or binary which contain data the project requires. They won't have everything encompassed by the project as a text file. It's weird when that's true for a project. It's a unique situation to the Linux kernel because it, somewhat uniquely, doesn't have graphics or large, predetermined data blocks. Well, not all projects that need to be managed by git share 100% of the attributes of the Linux kernel.
This idea that everything in a git project must be all small text file is incredibly bizarre. Are you making a video game? A website? A web application? A data driven API? Does it have geographic data? Does it required images? Video? Music or sound? Are you providing static documentation that must be included?
So the choices are:
1. Git is useful general purpose VCS for real world projects. 2. Git does not permit binary or large files.
Tracking versioning on large files is not some massively complex problem. Not needing to care about diffing and merging simplifies how those files are managed.
Yes because Git currently is not good at tracking large file. That's not some fundamental property of Git; it can be improved.
Btw it isn't GIT.
In other words, if you migrate a repo that has commits A->B->C, and C adds the large files, then commits A & B will gain a `.gitattributes` referring to the large files that do not exist in A & B.
This is because the migration function will carry its ~gitattributes structure backwards as it walks the history, for caching purposes, and not cross-reference it against the current commit.
https://github.com/git-lfs/git-lfs/blob/main/docs/man/git-lf...
Now, granted, usually people run migrate to only convert new local commits, so by nature of the ref include/exclude system it will not touch older commits. But in my case I was converting an entire repo into one using LFS. I hoped it would preserve those commits in a base branch that didn't contain large files, but my disappointment was said .gitattributes pollution.
> In all modes, by default git lfs migrate operates only on the currently checked-out branch, and only on files (of any size and type) added in commits which do not exist on any remote. Multiple options are available to override these defaults.
Were your remotes not configured correctly?
> But in my case I was converting an entire repo into one using LFS.
then check out the section in the manual "INCLUDE AND EXCLUDE REFERENCES"
Yea, I had the same thought. And TBD on large object promisors.
Git annex is somewhat more decentralized as it can track the presence of large files across different remotes. And it can pull large files from filesystem repos such as USB drives. The downside is that it's much more complicated and difficult to use. Some code forges used to support it, but support has since been dropped.
I think it is a much bigger barrier than ssh and have seen it be one on short timeline projects where it's getting set up for the first time and they just end up paying github crazy per GB costs, or rat nests of tunnels vpn configurations for different repos to keep remote access with encryption with a whole lot more trouble than just an ssh path.
Commit IDs are based on a number of factors about the commit, including the actual contents and the commit ID of the parent commit. Any fully cloned git repository can theoretically be audited to make sure that all its commit IDs are correct. Nobody does this (although perhaps git does automatically?), but it's possible.
But now, picture a git repository that has a one petabyte file in one of its early commits (and deleted again later). Pretty much nobody is going to have the space required to download this, so many people will not even bother to do so. As such, what's to stop the server from just claiming any commit ID it wanted for this particular commit? Who's going to check?
(Bonus: For that matter, is the one petabyte file even real? Or just a faked size in the metadata?)
To be clear, I assume people have already thought about these issues. I'm just curious what the answers are.
> High vendor lock-in – When GitHub wrote Git LFS, the other large file systems—Git Fat, Git Annex, and Git Media—were agnostic about the server-side. But GitHub locked users to their proprietary server implementation and charged folks to use it.
Is this a current issue?
I used Git LFS with a GitLab instance this week, seemed to work fine.
https://docs.gitlab.com/topics/git/lfs/
I also used Git LFS with my Gitea instance a week before that, it was fine too.
https://docs.gitea.com/administration/git-lfs-setup
At the same time it feels odd to hear mentions of LFS being deprecated in the future, while I’ve seldom seen anyone even use it - people just don’t seem to care and shove images and such into regular Git which puzzles me.
Nowhere is this behavior explicitly stated.
I used to use Git LFS on GitHub to do my company’s study on GitHub statistics because we stored large compressed databases on users and repositories.
Is that true? I used git commercially in five companies, and I never used github commercially (except as a platform for projects we opensourced).
You already depend on github if you host your project there. But you're not locked in, because you can just close your github repo and migrate somewhere else. Do I miss something?
If you used LFS, you have to fork and rewrite your repository to update the .lfsconfig backend URLs to get back to a reasonable working state.
No comments yet
What does SVN do differently than git when it comes to large binary files, and why can't git use the same approach?
I also don't quite understand tbh how offloading large files to somewhere else would be fundamentally different than storing all files in one place except complicating everything? Storage is storage, how would a different storage location fix any of the current performance and robustness problems? Offloading just sounds like a solution for public git forges which don't want to deal with big files because it's too costly for them, but increased hosting cost is not the 'large binary file problem' of git.
(edit: apparently git supports proper locking(?) so I removed that section - ps: nvm it looks like the file locking feature is only in git-lfs)
People should use the VCS that's appropriate for their project rather than insist on git everywhere.
A lot of people don't seem to realise this. I work in game dev and SVN or Perforce are far far better than Git for source control in this space.
In AA game dev a checkout (not the complete history, not the source art files) can easily get to 300GB of binary data. This is really pushing Subversion to it's limits.
In AAA gamedev you are looking at a full checkout of the latest assets (not the complete history, not the source art files) of at least 1TB and 2TB is becoming more and more common. The whole repo can easily come in at 100 TB. At this scale Perforce is really the only game in town (and they know this and charge through the nose for it).
In the movie industry you can multiply AAA gamedev by ~10.
Git has no hope of working at this scale as much as I'd like it to.
Github/gitlab is miles ahead of anything you can get with Perforce. People are not just pushing for git because they ux of it, they're pushing git so they can use the ecosystem.
The above should work. But does git support multiple filters for a file? For example first the above asset split filter and then store the files in LFS which is another filter.
I hope this "new" system works but I think Perforce is safe for now.
Disagree. I really like the "de-facto standard" that git has become for open source. It means if I want to understand some new project's source code, there is one less hassle for me to deal with: I don't need to learn any new concepts just to access the source code and all the tooling is already right there.
The situation we have with package managers, dependency managers and package managers for package managers is worse enough. I really don't want a world in which every language or every project also comes with its own version control system and remote repo infrastructure.
It's only git which has this fractal feature set which requires expert knowledge to untangle.
If nothing else, you have to install it. There will also be subtle differences between concepts, e.g. git and svn both have versions and branches, but the concepts behave differently. I don't know about Mercurial, but I'm sure they have their own quirks as well.
Also, tooling: I have a VSCode plugin that visualizes the entire graph structure of a git repo really nicely. Right now, I can use that on 99% of all repos to get an overview of the branches, last commits, activity, etc.
If version systems were fragmented, I'd have to look for equivalent tools for every versioning system separately - if they exist at all. More likely, I'd be restricted just to the on-board tools of every system.
They’re similar in the UI but the underlying architecture is vastly different, to accomplish different goals - sometimes what you want is an entirely centralized VCS, decentralized VCS, or a mix of both.
As for the tooling, any decent IDE supports different systems equally well. With IntelliJ I can use Git, SVN, and even CVS through the same UI. But yes, VSCode plugin XYZ doesn’t.
I can't imagine living without that feature, but I also do a lot of OSS work so I'm probably biased.
Also, designing around distribution meant that merges have to be fast and work well -- this is a problem that most centralised systems struggle with because it's not a key part of the workflow. Branching and merging are indispensable tools in version control and I'm shocked how long CVS and SVN survived despite having completely broken systems for one or both. Being able to do both (and do stuff like blames over the entire history) without needing to communicate with a server is something that changes the way you work.
My actual hot take (as a kernel developer) is that email patches are good, actually. I really wish more people were aware of how incredibly powerful they are -- I have yet to see a source forge that can get close to the resiliency and flexibility of email (though SourceHut gets very close). git-send-email has its quirks, but b4 honestly makes it incredibly easy.
(There's also the owning your own data benefits that I think often go overlooked -- reflog and local clones mean that you cannot practically lose data even if the server goes away or you get blocked. I suspect Russian or Iranian developers losing access to their full repo history because of US sanctions wouldn't share your views on centralised services.)
And if you and another developer make conflicting changes while offline? What should happen when you return online?
E.g. with current svn you get the latest changes from the server, open a diff editor, fix the conflicts and then commit.
The only difference here between svn and git is that svn merges the 'commit' and 'push' operations into one, e.g. instead of not being allowed to push, you're not allowed to commit in svn if there are pending conflicts.
This would be the part that would need to change if svn would get a proper 'offline mode', e.g. commits would need to go into some sort of 'local staging queue' until you get internet access back, and conflict resolutions would need to happen on the commits in that staging queue. But I really doubt if that's worth the hassle because how often are you actually without internet while coding?
But git is likely to be appropriate almost everywhere. You won’t just use svn just for big file purposes while git is better for everything else in the same project
Well yeah because text files are small. Thinking text files are insignificant to games because they are small is a really dumb perspective.
> Yet still people try to use git for version control in game projects just because it is the popular option elsewhere and git is all they know.
Or perhaps it's because it works really well for text files, which are a significant part of most games, and because the tooling is much better than for other VCS's.
Fact is that code is only one aspect of a game project, and arguably not the most important. Forcing a programmer-centric workflow on artists and designers is an even dumber perspective ;)
> and because the tooling is much better than for other VCS's
...only for text files. For assets like images, 3d models or audio data it's pretty much a wasteland.
In games a lot of the tooling assumes P4 so it's often a better choice, on the whole, but if git and LFS was as widely supported in art tooling it would be the clear choice.
Which is kinda funny because most people use git through Github or Gitlab, e.g. forcing git back into a centralized model ;)
> People should use the VCS that's appropriate for their project rather than insist on git everywhere.
Indeed, but I think that train has left long ago :)
I had to look it up after I wrote that paragraph about locking, but it looks like file locking is supported in Git (just weird that I need to press a button in the Gitlab UI to lock a file:
https://docs.gitlab.com/user/project/file_lock/
...and here it says it's only supported with git lfs (so still weird af):
https://docs.gitlab.com/topics/git/file_management/#file-loc...
...why not simply 'git lock [file]' and 'git push --locks' like it works with tags?
Of course if you’re working with others you will want a central Git server you all synchronize local changes with. GitHub is just one of many server options.
No comments yet
I think we really need more development of format-specific diff and merge tools. A lot of binary formats could absolutely be diffed or merged, but you'd need algorithms and maybe UIs specific to that format - there is no "generic" algorithm like for text-based files. (And even there, generic line-wise diffing if often more "good enough" than really good)
I think it would be awesome if we could get "diff/merge servers" analogous to the "language servers" for IDEs some day.
https://github.com/ewanmellor/git-diff-image/blob/master/REA...
https://zachholman.com/posts/command-line-image-diffs/
The alternative of preventing complex merge situations in the first place through file locking is low-tech, easy to implement, and automatically works on all current and future file formats.
The problem was that the scene information was fundamentally visual (assets arranged in 3D space) so even a diffable text format wouldn't help you much. On the other hand, scenes are large enough that you often would want to work on them in parallel with other people.
I believe their first solution to that was the Asset Server that supported locking. But that still doesn't give two people the ability to work on a scene concurrently.
Eventually, some users went and developed a custom diff/merge tool to solve the problem.
https://discussions.unity.com/t/scene-diff-ease-your-sufferi...
I've never done exactly that but I have occasionally decided how information will be represented in a data file with merging in mind.
Storage is not storage as you can store things as copies or diffs (and a million other ways). For code, diffs are efficient but for binaries, diffs approach double the size of the original files, simply sending/storing the full file is better.
These differences have big effects on how git operates and many design choices assumed diffable text assets only.
If you do a little research, there's plenty of information on the topic.
Though I did work hard to remove any vendor lock-in by working directly with Atlassian and Microsoft pretty early in the process. It was a great working relationship, with a lot of help from Atlassian in particular on the file locking API. LFS shipped open source with compatible support in 3 separate git hosts.
With a good build system using a shared cache, it makes for a very pleasant development environment.
[1] https://stackoverflow.com/questions/2144406/how-to-make-shal...
Tying it all in with GitLab, Artifactory, CodeSonar, Anchore etc
Release artifacts like a .exe would NOT belong in Git because it is not source code.
Commits become multi-step, as you need to first commit the artifacts to get their artifact IDs to put in the repo. You can automate that via git hooks, but then you're back at where you started: git-lfs.
While git LFS is just a kludge for now, writing a filter argument during the clone operation is not the long-term solution either.
Git clone is the very first command most people will run when learning how to use git. Emphasized for effect: the very first command.
Will they remember to write the filter? Maybe, if the tutorial to the cool codebase they're trying to access mentions it. Maybe not. What happens if they don't? It may take a long time without any obvious indication. And if they do? The cloned repo might not be compilable/usable since the blobs are missing.
Say they do get it right. Will they understand it? Most likely not. We are exposing the inner workings of git on the very first command they learn. What's a blob? Why do I need to filter on it? Where are blobs stored? It's classic abstraction leakage.
This is a solved problem: Rsync does it. Just port the bloody implementation over. It does mean supporting alternative representations or moving away from blobs altogether, which git maintainers seem unwilling to do.
The first time you try to commit on a new install, git nags you to set your email address and name. I could see something similar happen the first time you clone a repo that hits the default global filter size, with instructions on how to disable it globally.
> The cloned repo might not be compilable/usable since the blobs are missing.
Maybe I misunderstood the article, but isn't the point of the filter to prevent downloading the full history of big files, and instead only check out the required version (like LFS does).
So a filter of 1 byte will always give you a working tree, but trying to checkout a prior commit will require a full download of all files.
And yes, you can fix defaults without breaking backwards compatibility.
Not strictly true. They did change the default push behaviour from "matching" to "simple" in Git 2.0.
I agree with GP. The git community is very fond of doing checkbox fixes for team problems that aren’t or can’t be set as defaults and so require constant user intervention to work. See also some of the sparse checkout systems and adding notes to commits after the fact. They only work if you turn every pull and push into a flurry of activity. Which means they will never work from your IDE. Those are non fixes that pollute the space for actual fixes.
I’m not trying to argue that interface doesn’t matter. I use jq enough to be in that unfortunate category where I despise its interface. But it is difficult for me to imagine being similarly incapable in git.
Only the histories of the blobs are filtered out.
Can you explain what the solution is? I don't mean the details of the rsync algorithm, but rather what it would like like from the users' perspective. What files are on your local filesystem when you do a "git clone"?
This will sound absolutely insane, but maybe the source code for the video should be a script? Then the process of building produces a video which is a release artifact?
* https://github.com/LightArrowsEXE/Encoding-Projects
* https://github.com/Beatrice-Raws/encode-scripts
Based on my experience (YMMV), I think it is incorrect, yes, because any time I've performed a shallow clone of a repository, the saving wasn't as much as one would intuitively imagine (in other words: history is stored very efficiently).
The logical model of git is that it stores complete files. The physical model of git is that these complete files are stored as deltas within pack files (except for new files which haven't been packed yet; by default git automatically packs once there are too many of these loose files, and they're always packed in its network protocol when sending or receiving).
It would be nice to have a VCS that could manage these more effectively but most binary formats don't lend themselves to that, even when it might be an additional layer to an image.
I reckon there's still room for better image and video formats that would work better with VCS.
Nothing wrong with "forgetting" to write the filter, and then if it's taking more than 10 minutes, write the filter.
Is this really the best we can do in terms of user experience? No. git need to step up.
A beginner will follow instructions in a README "Run git clone" or "run git clone --depth=1
We have an open source CLI and server that mirrors git, but handles large files and mono repos with millions of files in a much more performant manner. Would love feedback if you want to check it out!
https://github.com/Oxen-AI/Oxen
Then the usual settings would be to shallow clone the latest content as well as fetch the full history and maybe the text file historical content. Ideally you could prune to the clone depth settings as well.
Why are we still talking about large file pointers? If you fix shallow and partial clones, then any repo can be an efficient file mirror, right?
> if I git clone a repo with many revisions of a noisome 25 MB PNG file
FYI ‘noisome’ is not a synonym for ‘noisy’ - it’s more of a synonym for ‘noxious’; it means something smells bad.
Does anyone have feedback about personally using DVC vs LFS?
Switched to https://github.com/kevin-hanselman/dud and I have been happy since ..
For me, the deciding factor was that with LFS, if you want to delete objects from storage, you have to rewrite git history. At least, that's what both the Github and Gitlab docs specify.
DVC adds a layer of indirection, so that its structure is not directly tied to git. If I change my mind and delete the objects from S3, dvc might stop working, but git will be fine.
Some extra pluses about DVC: - It can point to versioned S3 objects that you might already have as part of existing data pipelines. - It integrates with the Python fsspec library to read the files on demand using paths like "dvc://path/to/file.parquet". This feels nicer than needing to download all the files up front.
https://github.com/Oxen-AI/Oxen
or check out the performance numbers https://docs.oxen.ai/features/performance
[0]: https://github.com/datalad/datalad/blob/maint/datalad/suppor...
Any ideas why it isn’t more popular and more well known?
But it's not intended for or good at (without forcing a square peg into a round hole) the sort of thing LFS and promisors are for, which is a public project with binary assets.
git-annex is really for (and shines at) a private backup solution where you'd like to have N copies of some data around on various storage devices, track the history of each copy, ensure that you have at least N copies etc.
Each repository gets a UUID, and each tracked file has a SHA-256 hash. There's a branch which has a timestamp and repo UUID to SHA-256 mapping, if you have 10 repos that file will have (at least) 10 entries.
You can "trust" different repositories to different degrees, e.g. if you're storing a file on both some RAID'd storage server, or an old portable HD you're keeping in a desk drawer.
This really doesn't scale for a public project. E.g. I have a repository that I back up my photos and videos in, that repository has ~700 commits, and ~6000 commits to the metadata "git-annex" branch, pretty close to a 1:10 ratio.
There's an exhaustive history of every file movement that's ever occurred on the 10 storage devices I've ever used for that repository. Now imagine doing all that on a project used by more than one person.
All other solutions to tracking large files along with a git repository forgo all this complexity in favor of basically saying "just get the rest where you cloned me from, they'll have it!".
While git-annex works very well on Unix-style systems with Unix-style filesystems, it heavily depends on symbolic links, which do not exist on filesystems like exFAT, and are problematic on Windows (AFAIK, you have to be an administrator, or enable an obscure group policy). It has a degraded mode for these filesystems, but uses twice the disk space in that mode, and AFAIK loses some features.
(only half-trolling)
At the moment I'm using my own git server + git lfs deduplication using btrfs to efficiently handle the large files.
If large objects are just embedded in various packfiles this approach would no longer work, so I hope that such a behaviour can be controlled.
The old repo will still be pointed to whatever the LFS config was at that time. If that service is still up, it should continue to work.
In my case, with a 25GB repo, it was really detrimental to performance
To put it another way, regardless of what max size you give to --filter, you will end up with a complete git checkout, no missing files.
I’ve never used it for anything serious but my understanding is that Mercurial handles binary files better? Like it supports binary diffs if I understand correctly.
Any reason Git couldn’t get that?
Are there many binaries that people would store in git where this would actually help? I assume most files end up with compression or some other form of randomization between revisions making deduplication futile.
https://xethub.com/blog/benchmarking-the-modern-development-...
Editing the ID3 tag of an MP3 file or changing the rating metadata of an image will give a big advantage to block level deduplication. Only a few such cases are needed to more than compensate for that worse than nothing inefficiencies of binary diffs when there's nothing to deduplicate.
(It's been even longer since i used svn in anger, but maybe it could work too. It has file locking, and local storage cost is proportional to size of head revision. It was manageable enough with a 2 GB head revision. Metadata access speed was always terrible though, which was tedious.)
But a good segmentation is only good for better compression and nicer diff, git could do byte wise diffs with no issues, so I wonder why doesn't git use customizable segmentation strategies where it calls external tools based on file type (eg a rust thingy for rust file etc, or a PNG thingy for PNG files).
At worst the tool would return either a single segment for the entire file or the byte wise split which would work anyway
All deltas between versions are binary diffs.
Git has always handled large (including large binary) files just fine.
What it doesn't like is files where a conceptually minor change changes the entire file, for example compressed or encrypted files.
The only somewhat valid complaint is that if someone once committed a large file and then it was later deleted (maybe minutes later, maybe years later) then it is in the repo and in everyone's checkouts forever. Which applies equally to small and to large files, but large ones have more impact.
That's the whole point of a version control system. To preserve the history, allowing earlier versions to be recreated.
The better solution would be to have better review of changes pushed to the master repo, including having unreviewed changes in separate, potentially sacrificial, repos until approved.
There's a bunch of binary files that change a lot on small changes due to compression or how the data is serialised, so the problem doesn't go away completely. One could conceivably start handling that, but there are lots of file formats out there and the sum oc complexity tends to be bugs and security issues.
Another alternative would be storing the chunks as blobs so that you reconstruct the full binary and only have to store the changed chunks. However that doesn't work with compressed binaries.
The real problem is that Git wants you to have a full copy of all files that have ever existed in the repo. As soon as you add a large file to a repo it's there forever and can basically never be removed. If you keep editing it you'll build up lots more permanent data in the repo.
Git is really missing:
1. A way to delete old data.
2. A way for the repo to indicate which data is probably not needed (old large binaries).
3. A way to serve large files efficiently (from a CDN).
Some of these can sort of be done, but it's super janky. You have to proactively add confusing flags etc.
Prolly trees are very similar to Merkle trees or the rsync algorithm, but they support mutation and version history retention with some nice properties. For example: you always obtain exactly the same tree (with the same root hash) irrespective of the order of incremental edit operations used to get to the same state.
In other words, two users could edit a subset of a 1 TB file, both could merge their edits, and both will then agree on the root hash without having to re-hash or even download the entire file!
Another major advantage on modern many-core CPUs is that Prolly trees can be constructed in parallel instead of having to be streamed sequentially on one thread.
Then the really big brained move is to store the entire SCM repo as a single Prolly tree for efficient incremental downloads, merges, or whatever. I.e.: a repo fork could share storage with the original not just up to the point-in-time of the fork, but all future changes too.
If there’s a new algorithm out there that warrants a look…
We have also talked about doing something similar for tree objects in order to better support very large directories (to reduce the amount of data we need to transfer for them) and very deep directories (to reduce the number of roundtrips to the server). I think we have only talked about that on Discord so far (https://discord.com/channels/968932220549103686/969291218347...). It would not be compatible with Git repos, so it would only really be useful to teams outside Google once there's an external jj-native forge that decides to support it (if our rough design is even realistic).
Maybe we can fork something like codeberg's ui but use jj or maybe the jujutsu team can work with codeberg itself? I am pretty sure that codeberg team is really nice and this could be an experimental feature, which if it really needs to can be crowdfunded by community.
I will chip in the first dollar.
And yes you can represent a whole repo as a giant tar file, but because the boundaries between hash segments won't line up with your file boundaries you get an efficiency hit with very little benefit. Unless you make it file-aware in which case it ends up even closer to what git already does.
Git knows how to store deltas between files. Making that mechanism more reliable is probably able to achieve more with less.
One of our projects has a UI editor with a 60MB file for nearly everything except images, and people work on different UI flows at the same time.
Merging would require support from the DB engine, however.
That was actually an initial selling point of git: you have the full history locally. You can work from the plane/train/deserted island just fine.
These large files will persist in the repo forever. So people look for options to segregate large files out so that they only get downloaded on demand (aka "lazily").
All the existing options (submodules, LFS, partial clones) are different answers to "how do we make certain files only download on demand"
Don't forget sparse checkouts!
Nice to see some Microsoft and Google emails contributing.
Google has android and chromium as well as Git+Gerrit-on-Borg https://opensource.google/documentation/reference/releasing/...
These new features are pretty awesome too. Especially separate large object remotes. They will probably enable git to be used for even more things than it's already being used for. They will enable new ways to work with git.
Then you can clone without checking out all the unnecessary large files to get a working build, This also helps on the legal side to correctly license your repos.
I'm struggling to see how this is a problem with git and not just antipatterns that arise from badly organized projects.
I just want my files to match what's expected when I pull a commit, that doesn't require some literal "commit build system" and "pull build system". Coming from perforce and SVN, I can't comprehend why git it so popular, beyond cargo cult. It's completely nonsensical to think that software is just source.
Write your code/tools as if they will be used at 2:00 am while the server room is on fire. Because sooner or later they will be.
A lot of our processes are used like emergency procedures. Emergency procedures are meant to be brainless as much as possible. So you can reserve the rest of your capacity for the actual problem. My version essentially calls out Kernighan’s Law.
Git does the responsible thing and lets the user determine how to proceed with the mess they've made.
I must say I'm increasingly suspicious of the hate that git receives these days.
A good version control system would support petabyte scale history and terabyte scale clones via sparse virtual filesystem.
Git’s design is just bad for almost all projects that aren’t Linux.
(I know this will get downvoted. But most modern programmers have never used anything but Git and so they don’t realize their tool is actually quite bad! It’s a shame.)
I like this idea in principle but I always wonder what that would look in practice, outside a FAANG company: How do you ensure the virtual file system works equally well on all platforms, without root access, possibly even inside containers? How do you ensure it's fast? What do you do in case of network errors?
Tom Lyon: NFS Must Die! From NLUUG 2024:
https://www.youtube.com/watch?v=ZVF_djcccKc
>Why NFS must die, and how to get Beyond FIle Sharing in the cloud.
Slides:
https://nluug.nl/bestanden/presentaties/2024-05-21-tom-lyon-...
Eminent Sun alumnus says NFS must die:
https://blocksandfiles.com/2024/06/17/eminent-sun-alumnus-sa...
Re network errors. How many things break when GitHub is down? Quite a lot! This isn’t particularly special. Prefetch and clone are the same operation.
But most people don't need most of its features and many people need features it doesn't have.
If you look up git worktrees, you'll find a lot of blog articles referring to worktrees as a "secret weapon" or similar. So git's secret weapon is a mode that lets you work around the ugliness of branches. This suggests that many people would be better suited by an SCM that isn't branch-based.
It's nice having the full history offline. But the scaling problems force people to adopt a workflow where they have a large number of small git repos instead of keeping the history of related things together. I think there are better designs out there for the typical open source project.
In my experience, branches are totally awesome. Worktrees make branches even more awesome because they let me check out multiple branches at once to separate directories.
The only way it could get better is if it somehow gains the ability to check out the same branch to multiple different directories at once.
But I encourage everyone to try out a few alternatives (and adopt their workflows at least for a while). I have no idea if you have or not.
But fine has never used the alternatives, one doesn’t really know just how nice things can be. Or, even if you still find fit to be your preferred can, having an alternative experience can open you to other possibilities and ways of working.
Just like everyone should try a couple of different programming languages or editors or anything else for size. You may not end up choosing it, but seeing the possibilities and different ways of thinking is a very good thing.
This is only possible because git is decentralized. Claiming that git is centralized is complete falsehood.
Developers have their own drama
Still the approach is to put code and data in a folder and call it a day. Slap a "_FINAL" at the folder name and you are golden.
No comments yet
We found that we could move the large files to Artifactory as it has Git LFS support.
But the problem was the entire history that did not have Artifactory pointers. Every clone included the large files (for some reason the filter functionality wouldn’t work for us - it was a large repo and it it had hundreds of users amongst other problems)
Anyways what we ended up doing was closing that repo and opening a new one with the large files stripped.
Nitpick in the authors page:
“ Nowadays, there’s a free tier, but you’re dependent on the whims of GitHub to set pricing. Today, a 50GB repo on GitHub will cost $40/year for storage”
This is not true as you don’t need GitHub to get LFS support