I understand the decision to archive the upstream repo; as of when I left Meta, we (i.e. the Jemalloc team) weren’t really in a great place to respond to all the random GitHub issues people would file (my favorite was the time someone filed an issue because our test suite didn’t pass on Itanium lol). Still, it makes me sad to see. Jemalloc is still IMO the best-performing general-purpose malloc implementation that’s easily usable; TCMalloc is great, but is an absolute nightmare to use if you’re not using bazel (this has become slightly less true now that bazel 7.4.0 added cc_static_library so at least you can somewhat easily export a static library, but broadly speaking the point still stands).
I’ve been meaning to ask Qi if he’d be open to cutting a final 6.0 release on the repo before re-archiving.
At the same time it’d be nice to modernize the default settings for the final release. Disabling the (somewhat confusingly backwardly-named) “cache oblivious” setting by default so that the 16 KiB size-class isn’t bloated to 20 KiB would be a major improvement. This isn’t to disparage your (i.e. Jason’s) original choice here; IIRC when I last talked to Qi and David about this they made the point that at the time you chose this default, typical TLB associativity was much lower than it is now. On a similar note, increasing the default “page size” from 4 KiB to something larger (probably 16 KiB), which would correspondingly increase the large size-class cutoff (i.e. the point at which the allocator switches from placing multiple allocations onto a slab, to backing individual allocations with their own extent directly) from 16 KiB up to 64 KiB would be pretty impactful. One of the last things I looked at before leaving Meta was making this change internally for major services, as it was worth a several percent CPU improvement (at the cost of a minor increase in RAM usage due to increased fragmentation). There’s a few other things I’d tweak (e.g. switching the default setting of metadata_thp from “disabled” to “auto”, changing the extent-sizing for slabs from using the nearest exact multiple of the page size that fits the size-class to instead allowing ~1% guaranteed wasted space in exchange for reducing fragmentation), but the aforementioned settings are the biggest ones.
matoro · 16h ago
That was me that filed the Itanium test suite failure. :)
apaprocki · 14h ago
Ah, porting to HP Superdome servers. It’s like being handed a brochure describing the intricate details of the iceberg the ship you just boarded is about to hit in a few days.
A fellow traveler, ahoy!
boulos · 15h ago
The Itanic was kind of great :). I'm convinced it helped sink SGI.
crest · 14m ago
Itanium did its most most important job: it killed everything but ARM and POWER.
froh · 15h ago
Sunk by the Great Itanic ?
acdha · 8h ago
SGI and HP! Intel should have a statue of Rick Belluzzo on they’r campus.
sitkack · 11h ago
Why was the sinking of SGI great?
boulos · 4h ago
Oh, that wasn't the intent. I meant two separate things. The Itanic itself was kind of fascinating, but mostly panned (hence the nickname).
SGI's decision to built out Itanium systems may have helped precipitate their own downfall. That was sad.
kstrauser · 16h ago
Stuff like this is what keeps me coming back here. Thanks for posting this!
What's hard about using TCMalloc if you're not using bazel? (Not asking to imply that it's not, but because I'm genuinely curious.)
Svetlitski · 16h ago
It’s just a huge pain to build and link against. Before the bazel 7.4.0 change your options were basically:
1. Use it as a dynamically linked library. This is not great because you’re taking at a minimum the performance hit of going through the PLT for every call. The forfeited performance is even larger if you compare against statically linking with LTO (i.e. so that you can inline calls to malloc, get the benefit of FDO , etc.). Not to mention all the deployment headaches associated with shared libraries.
2. Painfully manually create a static library. I’ve done this, it’s awful; especially if you want to go the extra mile to capture as much performance as possible and at least get partial LTO (i.e. of TCMalloc independent of your application code, compiling all of TCMalloc’s compilation units together to create a single object file).
When I was at Meta I imported TCMalloc to benchmark against (to highlight areas where we could do better in Jemalloc) by pain-stakingly hand-translating its bazel BUILD files to buck2 because there was legitimately no better option.
As a consequence of being so hard to use outside of Google, TCMalloc has many more unexpected (sometimes problematic) behaviors than Jemalloc when used as a general purpose allocator in other environments (e.g. it basically assumes that you are using a certain set of Linux configuration options [1] and behaves rather poorly if you’re not)
As I observed when I was at Google: tcmalloc wasn't a dedicated team but a project driven by server performance optimization engineers aiming to improve performance of important internal servers.
Extracting it to github.com/google/tcmalloc was complex due to intricate dependencies (https://abseil.io/blog/20200212-tcmalloc ). As internal performance priorities demanded more focus, less time was available for maintaining the CMake build system.
Maintaining the repo could at best be described as a community contribution activity.
> Meta’s needs stopped aligning well with those of external uses some time ago, and they are better off doing their own thing.
I think Google's diverged from the external uses even long ago:) (For a long time google3 and gperftools's tcmalloc implementations were so different.)
mort96 · 11h ago
Everything from Google is an absolute pain to work with unless you're in Google using their systems, FWIW. Anything from the Chromium project is deeply intangled with everything else from the Chromium project as part of one gigantic Chromium source tree with all dependencies and toolchains vendored. They do not care about ABI what so ever, to the point that a lot of Google libraries change their public ABI based on whether address sanitizer is enabled or not, meaning you can't enable ASAN for your code if you use pre-built (e.g package manager provided) versions of their code. Their libraries also tend to break if you link against them from a project with RTTI enabled, a compiler set to a slightly different compiler version, or any number of other minute differences that most other developers don't let affect their ABI.
And if you try to build their libraries from source, that involves downloading tens of gigabytes of sysroots and toolchains and vendored dependencies.
Oh and you probably don't want multiple versions of a library in your binary, so be prepared to use Google's (probably outdated) version of whatever libraries they vendor.
And they make no effort what so ever to distinguish between public header files and their source code, so if you wanna package up their libraries, be prepared to make scripts to extract the headers you need (including headers from vendored dependencies), you can't just copy all of some 'include/' folder.
And their public headers tend to do idiotic stuff like `#include "base/pc.h"`, where that `"base/pc.h"` path is not relative to the file doing the include. So you're gonna have to pollute the include namespace. Make sure not to step on their toes! There's a lot of them.
I have had the misfortune of working with Abseill, their WebRTC library, their gRPC library and their protobuf library, and it's all terrible. For personal projects where I don't have a very, very good reason to use Google code, I try to avoid it like the plague. For professional projects where I've had to use libwebrtc, the only reasonable approach is to silo off libwebrtc into its own binary which only deals with WebRTC, typically with a line-delimited JSON protocol on stdin/stdout. For things like protobuf/gRPC where that hasn't been possible, you just have to live with the suffering.
..This comment should probably have been a blog post.
ahartmetz · 11h ago
I think your rant isn't long enough to include everything relevant ;)
The Blink web engine (which I sometimes compile for qtwebengine) takes a really long time to compile, several times longer than Gecko according to some info I found online. Google has a policy of not using forward declarations, including everything instead. That's a pretty big WTF for anyone who has ever optimized build time. Google probably just throws hardware and (distributed) caching at the problem, not giving a shit about anyone else building it. Oh, it also needs about 2 GB of RAM per build thread - basically nothing else does.
LtdJorge · 10h ago
Even with Firefox using Rust and requiring a build of many crates, qtwebengine takes more time. It was so bad that I had to remove packaged from my system (Gentoo) that were pulling qtwebengine.
And I build all Rust crates (including rustc) with -O3, same as C/C++.
That is really nice to hear, but AFAICS it only means that it may change in the future. Because in current code, it was ~all includes last time I checked.
Well, I remember one - very biased - example where I had a look at a class that was especially expensive to compile, like 40 seconds (on a Ryzen 7950X) and maybe 2 GB of RAM. It had under 200 LOC and didn't seem to do anything that's typically expensive to compile... except for the stuff it included. Which also didn't seem to do anything fancy. But transitive includes can snowball if you don't add any "compile firewalls".
bialpio · 42m ago
> Because in current code, it was ~all includes last time I checked.
That's another matter - just because forward-declares are allowed, doesn't mean they are mandated, but in my experience the reviewers were paying attention to that pretty well.
I picked couple random headers from the directory where I've contributed the most to blink, and from what I'm seeing, most of the classes that could be forward-declared, were. I have not looked at .cc files given that those tend to need to see the declaration (except when it's unused, but then why have a forward-decl at all?) or the compiler would complain about access into incomplete type.
> Well, I remember one - very biased - example where I had a look at a class that was especially expensive to compile, like 40 seconds (on a Ryzen 7950X) and maybe 2 GB of RAM. It had under 200 LOC and didn't seem to do anything that's typically expensive to compile... except for the stuff it included.
Maybe the stuff was actually being compiled because of some member in a class (so it was actually expensive to compile). Or maybe you stumbled upon a place where folks weren't paying attention. Hard to say without a concrete example. The "compile firewall" was added pretty recently I think, but I don't know if it's going to block anything from landing.
Edit: formatting (switched bulleted list into comma-separated because clearly I don't know how to format it).
The annotated red dots correspond to the last time Chrome developers did a big push to prune the include graph to optimize build time. It was effective, but there was push back. C++ developers just want magic, they don't want to think about dependency management, and it's hard to blame them. But, at the end of the day, builds scale with sources times dependencies, and if you aren't disciplined, you can expect superlinear build times.
ahartmetz · 18m ago
Good that it's being tracked, but Jesus, these numbers!
110 CPU hours for a build. (Fortunately, it seems to be a little over half that for my CPU. "Cloud CPUs" are kinda slow.)
I picked the 5001st largest file with includes. It's zoom_view_controller.cc, 140 lines in the .cc file, size with includes: 19.5 MB.
Initially I picked the 5000th largest file with includes, but for devtools_target_ui.cc, I see a bit more legitimacy for having lots of includes. It has 384 "own" lines in he .cc file and, of course, also about 19.5 MB size with includes.
A C++20 source file including some standard library headers easily bloats to a little under 1 MB IIRC, and that's already kind of unreasonable. 20x of that is very unreasonable.
I don't think that I need to tell anyone on the Chrome team how to improve performance in software: you measure and then you tackle the dumb low-hanging fruit first. From these results, it doesn't seem like anyone is working with the actual goal to improve the situation as long as the guidelines are followed on paper.
fc417fc802 · 10h ago
Reading this perspective was interesting. I can appreciate that things didn't fit into your workflow very well, but my experience has been the opposite. Their projects seem to be structured from the perspective of building literally everything from source on the spot. That matches my mindset - I choose to build from scratch in a network isolated environment. As a result google repos are some of the few that I can count on to be fairly easy to get up and running. An alarming number of projects apparently haven't been tested under such conditions and I'm forced to spend hours patching up cmake scripts. (Even worse are the projects that require 'npm install' as part of the build process. Absurdity.)
> Oh and you probably don't want multiple versions of a library in your binary, so be prepared to use Google's (probably outdated) version of whatever libraries they vendor.
This is the only complaint I can relate to. Sometimes they lag on rolling dependencies forward. Not so infrequently there are minor (or not so minor) issues when I try to do so myself and I don't want to waste time patching my dependencies up so I get stuck for a while until they get around to it. That said, usually rolling forward works without issue.
> if you try to build their libraries from source, that involves downloading tens of gigabytes of sysroots and toolchains and vendored dependencies.
Out of curiosity which project did you run into this with? That said, isn't the only alternative for them moving to something like nix? Otherwise how do you tightly specify the build environment?
mort96 · 5h ago
I don't really have the care nor time to respond as thoroughly as you deserve, but here are some thoughts:
> Out of curiosity which project did you run into this with?
Their WebRTC library for the most part, but also the gRPC C++ library. Unlike WebRTC, grpc++ is in most package managers so the need to build it myself is less, but WebRTC is a behemoth and not in any package manager.
> That said, isn't the only alternative for them moving to something like nix? Otherwise how do you tightly specify the build environment?
I don't expect my libraries to tightly specify the build environment. I expect my libraries to conform to my software's build environment, to use versions of other libraries that I provide to it, etc etc. I don't mind that Google builds their application software the way they do, Google Chrome should tightly constrain its build environment if Google wants; but their libraries should fit in to my environment.
I'm wondering, what is your relationship with Google software that you build from source? Are you building their libraries to integrate with your own applications, or do you just build Google's applications from source and use them as-is?
bluGill · 6h ago
> I choose to build from scratch in a network isolated environment. As a result google repos are some of the few that I can count on to be fairly easy to get up and running.
If you are building a single google project they are easy to get up and running. If you are building your own project on top of theirs things get difficult. those library issues will get you.
I don't know about OP, but we have our own in house package manager. If Conan was ready a couple years sooner we would have used that instead.
rstat1 · 2h ago
I agree to a point. grpc++ (and protobuf and boringssl and abseil and....) was the biggest pain in the ass to integrate in to a personal project I've ever seen. I ended up having to write a custom tool to convert their Bazel files to the format my projects tend to use (GN and Ninja). Many hours wasted. There were no library specfici "sysroots" or "toolchains" involved though thankfully because I'm sure that would made things even worse.
Upside is (I guess) if I ever want to use grpc in another project the work's already done and it'll just be a matter of copy/paste.
pavlov · 11h ago
This matches my own experience trying to use Google's C++ open source. You should write the blog post!
ewalk153 · 8h ago
I’ve hit similar problems with their Ruby gRPC library.
The counter example is the language Go. The team running Go has put considerable care and attention into making this project welcoming for developers to contribute, while still adhering to Google code contribution requirements. Building for source is straightforward and iirc it’s one of the easier cross compilers to setup.
Go is kinda of a pain to build from source. Build one version to build another, and another..
Or rather it was the last time I tried.
rfoo · 11h ago
> they make no effort what so ever to distinguish between public header files and their source code
They did, in a different way. The world is used to distinguish by convention, putting them in different directory hierarchy (src/, include/). google3 depends on the build system to do so, "which header file is public" is documented in BUILD files. You are then required to use their build system to grasp the difference :(
> And their public headers tend to do idiotic stuff like `#include "base/pc.h"`, where that `"base/pc.h"` path is not relative to the file doing the include.
I have to disagree on this one. Relying on relative include paths suck. Just having one `-I/project/root` is the way to go.
mort96 · 11h ago
> I have to disagree on this one. Relying on relative include paths suck. Just having one `-I/project/root` is the way to go.
Oh to be clear, I'm not saying that they should've used relative includes. I'm complaining that they don't put their includes in their own namespace. If public headers were in a folder called `include/webrtc` as is the typical convention, and they all contained `#include <webrtc/base/pc.h>` or `#include "webrtc/base/pc.h"` I would've had no problem. But as it is, WebRTC's headers are in include paths which it's really difficult to avoid colliding with. You'll cause collisions if your project has a source directory called `api`, or `pc`, or `net`, or `media`, or a whole host of other common names.
rfoo · 8h ago
Thanks for the clarification. Yeah, that's pretty frustrating.
Now I'm curious why grpc, webrtc and some other Chromium repos were set up like this.
Google projects which started in google3 and later exported as an open source project don't have this defect, for example tensorflow, abseil etc. They all had a top-level directory containing all their codes so it becomes `#include "tensorflow/...`.
Feels like a weird collision of coding style and starting a project outside of their monorepo
alextingle · 10h ago
>> `#include "base/pc.h"`, where that `"base/pc.h"` path is not relative to the file doing the include.
> I have to disagree on this one.
The double-quotes literally mean "this dependency is relative to the current file". If you want to depend on a -I, then signal that by using angle brackets.
mort96 · 9h ago
Eh, no. The quotes mean "this is not a dependency on a system library". Quotes can include relative to the files, or they can include things relative to directories specified with -I. The only thing they can't is include things relative to directories specified with -isystem and system include directories.
I would be surprised if I read some project's code where angle brackets are used to include headers from within the same project. I'm not surprised when quotes are used to include code from within the project but relative to the project's root.
kstrauser · 16h ago
Wow. That does sound quite unpleasant.
Thanks again. This is far outside my regular work, but it fascinates me.
prpl · 15h ago
I’ve successfully used LLMs to migrate Makefiles to bazel, more or less. I’ve not tried the reverse but suspect (2) isn’t so bad these days. YMMV, of course, but food for thought
benced · 2h ago
Yep I've done something similar. This is the only way I managed to compile Google's C++ S2 library (spatial indexing) which depends on absl and OpenSSL.
(I managed to avoid infecting my project with boringSSL)
rfoo · 11h ago
Dunno why you got downvoted, but I've also tried to let Claude translate a bunch of BUILD files to equivalent CMakeLists.txt. It worked. The resulting CMakeLists.txt looks super terrible, but so is 95% of CMakeLists.txt in this world, so why bother, it's doomed anyway.
mort96 · 11h ago
They got downvoted because 1) comments of the form "I gave a chat bot a toy example of a task and it managed it" are tired and uninformative, and 2) because nobody was talking about anything which would make translating a Makefile into Bazel somehow relevant, nobody here has a Makefile which we wish was Bazel, we wish Google code was easier to work with
prpl · 45m ago
People are discussing things that are tedious work. I think the conversion to Bazel from a makefile is much more tedious and error prone than the reverse, in part because of Bazel sandboxing although that shouldn’t make much of a difference for a well-defined collection of Makefiles of a C library.
The reverse should be much easier, which was the point of the post. Pointing it out as a capability (translation of build systems) that is handled well, is, well, informative. The future isn’t evenly distributed and people aren’t always aware of capabilities, even on HN
jeffbee · 7h ago
The person above was saying they did a tedious manual port of tcmalloc to buck. Since tcmalloc provides both bazel and cmake builds, it seems relevant that in these days a person could have potentially forced a robot to do the job of writing the buck file given the cmake or bazel files.
gazpacho · 15h ago
I would love to see these changes - or even some sort of blog post or extended documentation explaining rational. As is the docs are somewhat barren. I feel that there’s a lot of knowledge that folks like you have right now from all of the work that was done internally at Meta that would be best shared now before it is lost.
Thaxll · 57m ago
It's kind of wild that great software is hindered by a complicated build and integration process.
klabb3 · 7h ago
> we (i.e. the Jemalloc team) weren’t really in a great place to respond to all the random GitHub issues people would file
Why not? I mean this is complete drive-by comment, so please correct me, but there was a fully staffed team at Meta that maintained it, but was not in the best place to manage the issues?
anonymoushn · 9m ago
Well, to be blunt, the company does not care about this, so it does not get done.
xcrjm · 5h ago
They said the team was not in a great place to do it, eg. they probably had competing priorities that overshadowed triaging issues.
EnPissant · 15h ago
Do you have any opinions on mimalloc?
einpoklum · 11h ago
> TCMalloc is great, but is an absolute nightmare to use if you’re not using bazel
custom-malloc-newbie question: Why is the choice of build system (generator) significant when evaluating the usability of a library?
fc417fc802 · 10h ago
Because you need to build it to use it, and you likely already have significant build related infrastructure, and you are going to need to integrate any new dependencies into that. I'm increasingly convinced that the various build systems are elaborate and wildly successful ploys intended only to sap developer time and energy.
CamouflagedKiwi · 2h ago
Because you have to build it. If they don't use the same build system as you, you either want to invoke their system, or import it into yours. The former is unappealing if it's 'heavy' or doesn't play well as a subprocess; the latter can take a lot of time if the build process you're replicating is complex.
I've done both before, and seen libraries at various levels of complexity; there is definitely a point where you just want to give up and not use the thing when it's very complex.
adityapatadia · 13h ago
Jason, here is a story about how much your work impacts us.
We run a decently sized company that processes hundreds of millions of images/videos per day. When we first started about 5 years ago, we spent countless hours debugging issues related to memory fragmentation.
One fine day, we discovered Jemalloc and put it in our service, which was causing a lot of memory fragmentation. We did not think that those 2 lines of changes in Dockerfile were going to fix all of our woes, but we were pleasantly surprised. Every single issue went away.
Today, our multi-million dollar revenue company is using your memory allocator on every single service and on every single Dockerfile.
Thank you! From the bottom of our hearts!
laszlojamf · 12h ago
I really don't mean to be snarky, but honest question:
Did you donate? Nothing says thank you like some $$$...
onli · 11h ago
It was a meta project and development ceased. For a regular project that expectation is fine, but here it does not apply IMHO.
adityapatadia · 5h ago
We regularly donate to project via open collective. We frankly did not see here due to FB involvement I think.
thewisenerd · 12h ago
indeed! most image processing golang services suggest/use jemalloc
Interesting that one of the factor listed in there, the hardcoded page-size on arm64, is still is an unsolved issue upstream, and that forces app developers to either ship multiple arm64 linux binaries, or drop support for some platforms.
I wonder if some kind of dynamic page-size (with dynamic ftrace-style binary patching for performance?) would have been that much slower.
pkhuong · 4h ago
You can run jemalloc configured with 16KB pages on a 4KB page system.
dazzawazza · 9h ago
I've used jemalloc in every game engine I've written for years. It's just the thing to do. WAY faster on win32 than the default allocator. It's also nice to have the same allocator across all platforms.
I learned of it from it's integration in FreeBSD and never looked back.
jemalloc has help entertained a lot of people :)
Iwan-Zotow · 2h ago
+1
windows def allocator is pos. Jemalloc rules
int_19h · 1m ago
> windows def allocator
Which one of them? These days it could mean HeapAlloc, or it could mean malloc from uCRT.
ahartmetz · 44m ago
>windows def allocator is pos
Wow, still? I remember allocator benchmarks from 10-15 years ago where there were some notable differences between allocators... and then Windows with like 20% the performance of everything else!
chubot · 17h ago
Nice post -- so does Facebook no longer use jemalloc at all? Or is it maintenance mode?
Or I wonder if they could simply use tcmalloc or another allocator these days?
Facebook infrastructure engineering reduced investment in core technology, instead emphasizing return on investment.
Svetlitski · 15h ago
As of when I left Meta nearly two years ago (although I would be absolutely shocked if this isn’t still the case) Jemalloc is the allocator, and is statically linked into every single binary running at the company.
> Or I wonder if they could simply use tcmalloc or another allocator these days?
Jemalloc is very deeply integrated there, so this is a lot harder than it sounds. From the telemetry being plumbed through in Strobelight, to applications using every highly Jemalloc-specific extension under the sun (e.g. manually created arenas with custom extent hooks), to the convergent evolution of applications being written in ways such that they perform optimally with respect to Jemalloc’s exact behavior.
anonymoushn · 14h ago
The big recent change is that jemalloc no longer has any of its previous long-term maintainers. But it is receiving more attention from Facebook than it has in a long time, and I am somewhat optimistic that after some recent drama where some of that attention was aimed in a counterproductive direction that the company can aim the rest of it in directions that Qi and Jason would agree with, and that are well aligned with the needs of external users.
charcircuit · 15h ago
Meta has a fork that they still are working on, where development is continuing.
The point of the blog post is that repo is over-focused on Facebook's needs instead of "general utility":
> as a result of recent changes within Meta we no longer have anyone shepherding long-term jemalloc development with an eye toward general utility
> we reached a sad end for jemalloc in the hands of Facebook/Meta
> Meta’s needs stopped aligning well with those of external uses some time ago, and they are better off doing their own thing.
nh2 · 7h ago
But I'd like to know exactly what that means.
How can I find out if Facebook's focus is aligned with my own needs?
burnt-resistor · 8h ago
They take everything FLOSS and ruin it with bureaucracy, churn, breakage, and inconsideration to external use. They may claim FOSS broadly but it's mostly FOSS-washed, unusable garbage except for a few popular things.
umanwizard · 2h ago
React, PyTorch, and RocksDB are all extremely significant. Not to mention them being one of the biggest contributors to the Linux kernel.
wiz21c · 12h ago
FTA:
> And people find themselves in impossible situations where the main choices are 1) make poor decisions under extreme pressure, 2) comply under extreme pressure, or 3) get routed around.
It doesn't sound like a work place :-(
bravetraveler · 11h ago
Sounds like every workplace I've 'enjoyed' since ~2008
throwaway314155 · 9h ago
nice username
- fsociety
mrweasel · 8h ago
Now I'm not one for victim blaming, but if that's more than three places of employment, maybe you need to rethink the positions you apply for.
acdha · 6h ago
There’s something to that but it is victim blaming if you’re not acknowledging the larger trends. There are a lot of places whose MBAs are attending the same conferences, getting the same recommendations from consultants, and hearing the same demands from investors. The push against remote work, for example, was all driven by ideology against most of the available data but it affected a huge number of jobs.
throw0101d · 6h ago
> The push against remote work, for example, was all driven by ideology against most of the available data but it affected a huge number of jobs.
And before that, open office plans.
You're saving on rent: great. But what is it doing to productivity?
Of course productivity doesn't show up on a spreadsheet, but rent does, so it's what about "the numbers" say.
meisel · 17h ago
I believe there’s no other allocator besides jemalloc that can seamlessly override macOS malloc/free like people do with LD_PRELOAD on Linux (at least as of ~2020). jemalloc has a very nice zone-based way of making itself the default, and manages to accommodate Apple’s odd requirements for an allocator that have tripped other third-party allocators up when trying to override malloc/free.
glandium · 15h ago
Note this requires hackery that relies on Apple not changing things in its system allocator, which has happened at least twice IIRC.
adgjlsfhk1 · 17h ago
I believe mimalloc works here (but might be wrong).
No comments yet
schrep · 14h ago
Your work was so impactful over a long period from Firefox to Facebook. Honored to have been a small part of it.
lbrandy · 14h ago
Suppose this is as good a place to pile-on as any.
Though this was not the post I was expecting to show up today, it was super awesome for me to get to have played my tiny part in this big journey. Thanks for everything @je (and qi + david -- and all the contributors before and after my time!).
liuliu · 3h ago
Your leadership on continuing investing in core technologies in Facebook were as fruitful as it could ever being. GraphQL, PyTorch, React to name a few cannot happen without.
kstrauser · 17h ago
I’ve wondered about this before but never when around people who might know. From my outsider view, jemalloc looked like a strict improvement over glibc’s malloc, according to all the benchmarks I’d seen when the subject came up. So, why isn’t it the default allocator?
toast0 · 13h ago
It is on FreeBSD. :P Change your malloc, change your life? May as well change your libc while you're there and use FreeBSD libc too, and that'll be easier if you also adopt the FreeBSD kernel.
I will say, the Facebook people were very excited to share jemalloc with us when they acquired my employer, but we were using FreeBSD so we already had it and thought it was normal. :)
sanxiyn · 17h ago
As far as I know there is no technical reason why jemalloc shouldn't be the default allocator. In fact, as pointed out in the article, it IS the default allocator on FreeBSD. My understanding is it is largely political.
kstrauser · 16h ago
Now that I think about it, I could easily imagine it being left out of glibc because it doesn't build on Hurd or something.
lloeki · 13h ago
> I could easily imagine it being left out of glibc because [...]
... its license is BSD-2-Clause ;)
hence "political"
vkazanov · 13h ago
Huh? Bsd-style licenses are fully compatible with gpl.
The problem is exactly this: Facebook becomes the upstream of a key part of your system.
And Facebook can just walk away from the project. Like it did just now.
lloeki · 12h ago
They are compatible but that's not the point.
If it were included it would instantly become a LGPL hard-fork because of any subsequently added line of code, if not by "virality" of the glibc license, at least because any glibc author code addition would be LGPL, per GNU project policy/ideology.
What prevents apple from working with gpl-style licenses is strict hatred towards code that they can't use without opensourcing it. So this is what prevents them from contributing to gpl projects: the need to control access to code.
Llvm is OK for them from this point of view: upstream is open but they can maintain and distribute their proprietary fork.
lloeki · 7h ago
> What prevents apple from working with gpl-style licenses is strict hatred towards code that they can't use without opensourcing it.
Specifically regarding the C blocks feature introduced in Snow Leopard, as I recall, Apple wrote implementations for both clang and gcc, attempted to upstream the gcc patchset, said gcc patchset was obviously under a GPL license, but the GCC team threw a fit because it wanted the code copyright to be attributed to the FSF, and that ended up as a stalemate.
If there was any hatred they could literally have skipped the whole gcc implementation + patchset upstreaming attempt altogether. Also they did have patchsets of various sizes on other projects, whose code ends up obviously being GPL as well.
The "hatred" came later with the GPLv3 family and the patent clause, which is a legal landmine, the FSF stating that signing apps is incompatible with the GPLv3, and getting hung up on copyright transfer.
> Apple's motives may not be pure, but it has published the code under the license required and it's the FSF's own copyright assignment policies that block the inclusion. The code is available and licensed appropriately for the version of GCC that Apple adopted. It might be nice if Apple took the further step of assigning copyright to the FSF, but the GPLv3 was not part of the bargain that Apple agreed to when it first started contributing to GCC.
The intent behind such copyright transfer is generally so that the recipient of the transfer can relicense without having to ask all contributors. Essentially as a contributor agreeing to a transfer means ceding up control on the license that you initially contributed under.
Read another way:
- the FSF says "this code is GPLv2"
- someone contributes under that GPLv2 promise, cedes copyright to the FSF because it's "the process"
- the FSF says "this code is now GPLv3 exclusively"
- that someone says "but that was not the deal!"
- the FSF says "I am altering the deal, pray I don't alter it any further."
Y_Y · 5h ago
Big evil FSF, always trying to extract value and increase their stock price.
lloeki · 3h ago
That was tongue-in-cheek, I thought Darth Vader's voice was enough of a cue.
License switches do happen though, and are the source of outrage. Cue redis.
The cause of transferring copyright is often practical (hard to track down + reach out to + gather answers from all authors-slash-contributors which hampers some critical decisions down the road); for the FSF it's ideological (GCC source code must remain under sole FSF control).
The consequence of the transfer though is not well understood by authors forfeiting their copyright: they essentially agree to work for free for whatever the codebase ends up being licensed to in the future, including possibly becoming entirely closed source.
Think of it next time you sign a CLA!
Y_Y · 1h ago
Apologies. I am generally against CLAs, and I think it's shitty of GNU/FSF to use them, even if they promise to only do good and free things.
favorited · 16h ago
Disclaimer: I'm not an allocator engineer, this is just an anecdote.
A while back, I had a conversation with an engineer who maintained an OS allocator, and their claim was that custom allocators tend to make one process's memory allocation faster at the expense of the rest of the system. System allocators are less able to make allocation fair holistically, because one process isn't following the same patterns as the rest.
Which is why you see it recommended so frequently with services, where there is generally one process that you want to get preferential treatment over everything else.
mort96 · 11h ago
The only way I can see that this would be true is if a custom allocator is worse about unmapping unused memory than the system allocator. After all, processes aren't sharing one heap, it's not like fragmentation in one process's address space is visible outside of that process... The only aspects of one process's memory allocation that's visible to other processes is, "that process uses N pages worth of resident memory so there's less available for me". But one of the common criticisms against glibc is that it's often really bad at unmapping its pages, so I'd think that most custom allocators are nicer to the system?
It would be interested in hearing their thoughts directly, I'm also not an allocator engineer and someone who maintains an OS allocator probably knows wayyy more about this stuff than me. I'm sure there's some missing nuance or context or which would've made it make sense.
jeffbee · 16h ago
I don't think that's really a position that can be defended. Both jemalloc and tcmalloc evolved and were refined in antagonistic multitenant environments without one overwhelming application. They are optimal for that exact thing.
lmm · 12h ago
> Both jemalloc and tcmalloc evolved and were refined in antagonistic multitenant environments without one overwhelming application. They are optimal for that exact thing.
They were mostly optimised on Facebook/Google server-side systems, which were likely one application per VM, no? (Unlike desktop usage where users want several applications to run cooperatively). Firefox is a different case but apparently mainline jemalloc never matched Firefox jemalloc, and even then it's entirely plausible that Firefox benefitted from a "selfish" allocator.
jeffbee · 12h ago
Google runs dozens to hundreds of unrelated workloads in lightweight containers on a single machine, in "borg". Facebook has a thing called "tupperware" with the same property.
favorited · 16h ago
It's possible that they were referring to something specific about their platform and its system allocator, but like I said it was an anecdote about one engineer's statement. I just remember thinking it sounded fair at the time.
vlovich123 · 15h ago
The “system” allocator is managing memory within a process boundary. The kernel is responsible for managing it across processes. Claiming that a user space allocator is greedily inefficient is voodoo reasoning that suggests the person making the claim has a poor grasp of architecture.
favorited · 14h ago
For context, the "allocator engineer" I was talking to was a kernel engineer - they have an extremely solid grasp of their platform's architecture.
The whole advantage of being the platform's system allocator is that you can have a tighter relationship between the library function and the kernel implementation.
jdsully · 14h ago
The "greedy" part is likely not releasing pages back to the OS in a timely manner.
nicoburns · 10h ago
That seems odd though, seeing as this is one of the main criticisms of glibc's allocator.
jeffbee · 7h ago
In the containerized environments where these allocators were mainly developed, it is all but totally pointless to return memory to the kernel. You might as well keep everything your container is entitled to use, because it's not like the other containers can use it. Someone or some automatic system has written down how much memory the container is going to use.
toast0 · 4h ago
Returning no longer used anonymous memory is not without benefits.
Returning pages allows them to be used for disk cache. They can be zeroed in the background by the kernel which may save time when they're needed again, or zeroing can be avoided if the kernel uses them as the destination of a full page DMA write.
Also, returning no longer used pages helps get closer to a useful memory used measurement. Measuring memory usage is pretty difficult of course, but making the numbers a little more accurate helps.
jeffbee · 14h ago
There are shared resources involved though, for example one process can cause a lot of traffic in khugepaged. However I would point out that is an endemic risk of Linux's overall architecture. Any process can cause chaos by dirtying pages, or otherwise triggering reclaim.
b0a04gl · 10h ago
jemalloc’s been battle tested in prod at scale, its license is permissive, and performance wins are known. so what exactly are we protecting by clinging to glibc malloc? ideological purity? legacy inertia? who’s actually benefiting from this status quo, and why do we still pretend it’s about “compatibility”?
o11c · 15h ago
For a long time, one of the major problems with alternate allocators is that they would never return free memory back to the OS, just keep the dirty pages in the process. This did eventually change, but it remains a strong indicator of different priorities.
There's also the fact that ... a lot of processes only ever have a single thread, or at most have a few background threads that do very little of interest. So all these "multi-threading-first allocators" aren't actually buying anything of value, and they do have a lot of overhead.
Semi-related: one thing that most people never think about: it is exactly the same amount of work for the kernel to zero a page of memory (in preparation for a future mmap) as for a userland process to zero it out (for its own internal reuse)
vlovich123 · 15h ago
That’s actually particular try to alternate allocators and not true for glibc if I recall correctly (it’s much worse at returning memory).
senderista · 13h ago
> Semi-related: one thing that most people never think about: it is exactly the same amount of work for the kernel to zero a page of memory (in preparation for a future mmap) as for a userland process to zero it out (for its own internal reuse)
Possibly more work since the kernel can't use SIMD
LtdJorge · 10h ago
Why is that? Doesn't Linux use SIMD for the crypto operations?
dwattttt · 7h ago
Allowing SIMD instructions to be used arbitrarily in kernel actually has a fair penalty to it. I'm not sure what Linux does specifically, but:
When a syscall is made, the kernel has to backup the user mode state of the thread, so it can restore it later.
If any kernel code could use SIMD registers, you'll have to backup and restore that too, and those registers get big. You could easily be looking at adding a 1kb copy to every syscall, and most of the time it wouldn't be needed.
kstrauser · 6h ago
Why is that? Couldn’t there be push_simd()/pop_simd() that the syscall itself uses around its SIMD calls?
If no syscalls use SIMD today, I’d think we’re starting from a safe position.
durrrrrrrrrrrrr · 5h ago
push_simd/pop_simd exist and are called kernel_fpu_begin/kernel_fpu_end. Their use is practically prohibited in most areas and iiuc not available on all archs, but it's available if needed.
kstrauser · 3h ago
Today I learned. Thanks!
durrrrrrrrrrrrr · 6h ago
It's not so much that you can't ever use it, it's more a you really shouldn't. It's more expensive, harder to use and rarely worth it. Main users currently are crypto and raid checksumming.
These allocators often have higher startup cost. They are designed for high performance in the steady state, but they can be worse in workloads that start a million short-lived processes in the unix style.
kstrauser · 16h ago
Oh, interesting. If that's the case, I can see why that'd be a bummer for short-lived command line tools. "Makes ls run 10x slower" would not be well received. OTOH, FreeBSD uses it by default, and it's not known for being a sluggish OS.
didip · 5h ago
Thanks for everything, JE!
jemalloc is always the first thing I installed whenever I had to provision bare servers.
If jemalloc is somehow the default allocator in Linux, I think it will not have a hard time retaining contributors.
Twirrim · 18h ago
Oh that's interesting. jemalloc is the memory allocator used by redis, among other projects. Wonder what the performance impact will be if they have to change allocators.
dpe82 · 18h ago
Why would they have to change? Sometimes software development is largely "done" and there isn't much more you need to do to a library.
Twirrim · 14h ago
While I certainly wish that more software would reach a "done" stage, I don't think jemalloc is necessarily there yet. Unfortunately I'm aware of there being bugs in the current version of jemalloc, when run in certain environment configurations, including memory leaks. I know the folks that found it were looking to report it, but I guess that won't happen now.
So that'll leave projects like redis & valkey with some decisions to make.
1) Keep jemalloc and accept things like memory leak bugs
2) Fork and maintain their own version of jemalloc.
3) Spend time replacing it entirely.
4) Hope someone else picks it up?
poorman · 17h ago
Jemalloc is used as an easy performance boost probably by every major Ruby on Rails server.
burnt-resistor · 8h ago
Some people believe everything must always be constantly tweaked, redone, broken and fixed, and churned for no reason. The only things that need to be fixed in mature, working software are bugs and security issues. It doesn't magically stop working or get "stale" unless dependencies, the OS, or build tools break.
Analemma_ · 17h ago
Memory allocators are something I expect to rapidly degrade in the absence of continuous updates as the world changes underneath you. Changing page sizes, new ucode latencies, new security features etc. all introduce either outright breakage or at least changing the optimum allocation strategy and making your old profiling obsolete. Not to mention the article already pointed out one instance where a software stack (KDE, in that case) used allocation profiles that broke an earlier version completely. Even though that's fixed now, any language runtime update or new feature could introduce a new allocation style that grinds you down.
As much as it's nice to think software can be done, I think something so closely tied to the kernel and hardware and the application layer, which all change constantly, never can be.
binary132 · 17h ago
“Software is just done sometimes” is a common refrain I see repeated among communities where irreplaceable software projects are often abandoned. The community consensus has a tendency to become “it is reliable and good enough, it must be done”.
jeffbee · 18h ago
For an example of why an allocator is a maintenance treadmill, consider that C++ recently (relatively) added sized delete, and Linux recently gained transparent huge pages.
Twirrim · 14h ago
It's been 14 years since THP got added to the kernel[1], surely we're past calling that "recent" :)
But if they'd declared the allocators "done" 15 years ago, then you wouldn't have it.
dymk · 18h ago
Technology marches on, and in some number of years other allocators will exist that outperform/outfeature jemalloc.
jcelerier · 18h ago
This number of years depending on your allocation profile could be something like -10 years easily. New allocators constantly crop up
edflsafoiewq · 17h ago
Presumably then the performance impact of any switch will be positive.
almostgotcaught · 8h ago
> Sometimes software development is largely "done"
Lol absolutely not
swinglock · 2h ago
Last I checked Redis used their own fork of jemalloc. It may not even be updated to the latest release.
perbu · 13h ago
Back in 2008-2009 I remember the Varnish project struggled with what looked very much like a memory leak. Because of the somewhat complex way memory was used, replacing the Glibc malloc with jemalloc was an immediate improvement and removed the leak-like behavior.
technion · 13h ago
I know through years of looking at Ruby on Rails performance a commonly cited quick win was to run with jemalloc.
spookie · 17h ago
Firefox as well.
burnt-resistor · 8h ago
Lesson: Don't let one megacorp dominate or take over your FOSS project. Push back somewhat and say "no" to too much help from one source.
igrunert · 7h ago
I think the author was happy to be employed by a megacorp, along with a team to push jemalloc forward.
He and the other previous contributors are free to find new employers to continue such an arrangement, if any are willing to make that investment. Alternatively they could cobble together funding from a variety of smaller vendors. I think the author is happy to move on to other projects, after spending a long time in this problem space.
I don’t think that “don’t let one megacorp hire a team of contributors for your FOSS project” is the lesson here. I’d say it’s a lesson in working upstream - the contributions made during their Facebook / Meta investment are available for the community to build upon. They could’ve just as easily been made in a closed source fork inside Facebook, without violating the terms of the license.
Also Mozilla were unable to switch from their fork to the upstream version, and didn’t easily benefit from the Facebook / Meta investment as a result.
ecshafer · 3h ago
He worked for like a decade at Facebook it looks like. I would guess at least at a Staff level. How many millions of dollars do you think he got from that? It doesnt sound like the worse trade in the world.
mavis · 16h ago
Switching to jemalloc instantly fixed an irksome memory leak in an embedded Linux appliance I inherited many moons ago. Thank you je, we salute you!
vlovich123 · 15h ago
That’s because sane allocators that aren’t glibc will return unused memory periodically to the OS while glibc prefers to permanently retain said memory.
masklinn · 14h ago
glibc will return memory to the OS just fine, the problem is that its arena design is extremely prone to fragmentation, so you end up with a bunch of arenas which are almost but not quite empty and can't be released, but can’t really be used either.
In my experience it delays it way too much, causing memory overuse and OOMs.
I have a Python program that allocates 100 GB for some work, free()s it, and then calls a subprocess that takes 100 GB as well. Because the memory use is serial, it should fit in 128 GB just fine. But it gets OOM-killed, because glibc does not turn the free() into an munmap() before the subprocess is launched, so it needs 200 GB total, with 100 GB sitting around pointlessly unused in the Python process.
This means if you use glibc, you have no idea how much memory your system will use and whether they will OOM-crash, even if your applications are carefully designed to avoid it.
I commented there 4 years ago the glibc settings MALLOC_MMAP_THRESHOLD_ and MALLOC_TRIM_THRESHOLD_ should fix that, but I was wrong: MALLOC_TRIM_THRESHOLD_ is apparently bugged and has no effect in some situations.
So in jemalloc, the settings to control this behaviour seem to actually work, in contrast to glibc malloc.
(I'm happy to be proven wrong here, but so far no combination of settings seem to actually make glibc return memory as written in their docs.)
From this perspective, it is frightening to see the jemalloc repo being archived, because that was my way to make sure stuff doesn't OOM in production all the time.
Crespyl · 14h ago
Can you elaborate on this? I don't know much about allocators.
How would the allocator know that some block is unused, short of `free` being called? Does glibc not return all memory after a `free`? Do other allocators do something clever to automatically release things? Is there just a lot of bookkeeping overhead that some allocators are better at handling?
mort96 · 11h ago
They're not really correct, glibc will return stuff back to the OS. It just has some quirks about how and when it does it.
First, some background: no allocator will return memory back to the kernel for every `free`. That's for performance and memory consumption reasons: the smallest unit of memory you can request from and return to the kernel is a page (typically 4kiB or 16kiB), and requesting and returning memory (typically called "mapping" and "unmapping" memory in the UNIX world) has some performance overhead.
So if you allocate space for one 32-byte object for example, your `malloc` implementation won't map a whole new 4k or 16k page to store 32 bytes. The allocator probably has some pages from earlier allocations, and it will make space for your 32-byte allocation in pages it has already mapped. Or it can't fit your allocation, so it will map more pages, and then set aside 32 bytes for your allocation.
This all means that when you call `free()` on a pointer, the allocator can't just unmap a page immediately, because there may be other allocations on the same page which haven't been freed yet. Only when all of the allocations which happen to be on a specific page are freed, can the page be unmapped. In a worst-case situation, you could in theory allocate and free memory in such a way that you end up with 100 1-byte allocations allocated across 100 pages, none of which can be unmapped; you'd be using 400kiB or 1600kiB of memory to store 100 bytes. (But that's not necessarily a huge problem, because it just means that future allocations would probably end up in the existing pages and not increase your memory consumption.)
Now, the glibc-specific quirk: glibc will only ever unmap the last page, from what I understand. So you can allocate megabytes upon megabytes of data, which causes glibc to map a bunch of pages, then free() every allocation except for the last one, and you'd end up still consuming many megabytes of memory. Glibc won't unmap those megabytes of unused pages until you free the allocation that sits in the last page that glibc mapped.
This typically isn't a huge deal; yes, you're keeping more memory mapped than you strictly need, but if the application needs more memory in the future, it'll just re-use the free space in all the pages it has already mapped. So it's not like those pages are "leaked", they're just kept around for future use.
It can sometimes be a real problem though. For example, a program could do a bunch of memory-intensive computation on launch requiring gigabytes of memory at once, then all that computation culminates in one relatively small allocated object, then the program calls free() on all the allocations it did as part of that computation. The application could potentially keep around gigabytes worth of pages which serve no purpose but can't be unmapped due to that last small allocation.
If any of this is wrong, I would love to be corrected. This is my current impression of the issue but I'm not an authoritative source.
adwn · 11h ago
When `free()` is called, the allocator internally marks that specific memory area as unused, but it doesn't necessarily return that area back to the OS, for two main reasons:
1. `malloc()` is usually called with sizes smaller than the sizes by which the allocator requests memory from the OS, which are at least page-sized (4096 bytes on x86/x86-64) and often much larger. After a `free()`, the freed memory can't be returned to the OS because it's only a small chunk in a larger OS allocation. Only after all memory within a page has been `free()`d, the allocator may, but doesn't have to, return that page back to the OS.
2. After a `free()`, the allocator wants to hang on to that memory area because the next `malloc()` is sure to follow soon.
This is a very simplified overview, and different allocators have different strategies for gathering new `malloc()`s in various areas and for returning areas back to the OS (or not).
gdiamos · 18h ago
Congrats on the great run and the future. Jemalloc was an inspirational to many memory allocators.
kstrauser · 16h ago
I was using FreeBSD back when jemalloc came along, and it blew my mind to imagine swapping out just that one (major) part of its libc. Honestly, it hadn't occured to me, and made me wonder what else we could wholesale replace.
the_mitsuhiko · 10h ago
All the allocators have the same issue. They largely work against a shared set of allocation APIs. Many of their users mostly engage via malloc and free.
So the flow is like this: user has an allocation looking issue. Picks up $allocator. If they have an $allocator type problem then they keep using it, otherwise they use something else.
There are tons of users if these allocators but many rarely engage with the developers. Many wouldn’t even notice improvements or regressions on upgrades because after the initial choice they stop looking.
I’m not sure how to fix that, but this is not healthy for such projects.
Cloudef · 8h ago
malloc is bad api in general, if you want to go fast you don't rely on general purpose allocator
const_cast · 4h ago
This is true, but the unfortunate thing with how C and C++ were developed is that pretty much everything just assumes the existence of malloc/free. So if you’re using third-party libraries then it’s out of your control mostly. Linking a new allocator is a very easy and pretty much free way to improve performance.
p0w3n3d · 15h ago
Thank you. Jemalloc was recently recommended to me on some presentation about Java optimization.
I wonder if you did get everything you should from the companies that use it. I mean sometimes I feel that big tech firms only use free software, never giving anything to it, so I hope you were the exception here.
jeffbee · 6h ago
Imagine being a Java developer and thinking "what have big tech corporations ever done for me?"
keybored · 4h ago
That are good for me, the developer.
mrweasel · 8h ago
Looking at all the comments and lightly browsing the source code, I'm amazed. Both at how much impact a memory allocator can make, but also how much code is involved.
I'm not really sure what I expected, but somehow I expect a memory allocator to be ... smaller, simpler perhaps?
ratorx · 8h ago
Memory allocators can be simple. In fact it was an assignment for a course in the 2nd year of my CS degree to make an (almost) complete allocator.
However it is typically always more complex to make production quality software, especially in a performance sensitive domain.
burnt-resistor · 8h ago
Naive allocators are very easy: just subdivide RAM and defragment only when absolutely necessary (if virtual memory is unavailable). Performant allocators are hard.
I think we lost a great deal of potential when ORCA was too tied to Pony and not extracted to a framework, tool, and/or library useful outside of it such as integrated or working with LLVM.
swinglock · 1h ago
mimalloc is cleaner but lacks the very useful profiling features. To be fair it also has not gone through decades of changes as described in the postmortem either.
const_cast · 4h ago
It’s the same way with garbage collectors.
You can write a naive mark-and-sweep in an afternoon. You can write a reference counter in even less time. And for some runtimes this is fine.
But writing a generational, concurrent, moving GC takes a lot of time. But if you can achieve it, you can get amazing performance gains. Just look at recent versions of Java.
nevon · 17h ago
I very recently used jemalloc to resolve a memory fragmentation issue that caused a service to OOM every few days. While jemalloc as it is will continue to work, same as it does today, I wonder what allocator I should reach for in the future. Does anyone have any experiences to share regarding tcmalloc or other allocators that aim to perform better than stock glibc?
beyonddream · 16h ago
Try mimalloc. I have prototyped a feature on top of mimalloc and while effort was a dead end, the code (this was around 2020) was nicely written and well maintained and it was fun to hack on it. When I swapped jemalloc in our system with mimalloc, it was on par if not better when it comes to fragmentation growth control and heap usage perspective.
sanxiyn · 17h ago
mimalloc is a good choice. CPython recently switched to mimalloc.
kev009 · 17h ago
snmalloc
poorman · 17h ago
How cool would it be to see Doug Lea pick up the torch and create a modern day multi-threaded dlmalloc2!?
ecshafer · 17h ago
dl is just an observer on the open jdk governance board now, so he might have enough time.
b0a04gl · 12h ago
been using jemalloc unknowingly for a long time. only after reading this post it hit how much of it was under the hood in things I’ve built. didn’t know the gc-style decay mechanism was that involved, or that it handled fragmentation with time-based heuristics. surprising how much tuning was exposed through env vars. solid closure
jeffbee · 18h ago
The article mentioned the influence of large-scale profiling on both jemalloc and tcmalloc, but doesn't mention mimalloc. I consider mimalloc to be on par with these others, and now I am wondering whether Microsoft also used large scale profiling to develop theirs, or if they just did it by dead reckoning.
bch · 17h ago
How does mimalloc telemetry compare to jemalloc?
dikei · 17h ago
I still remember the day when I used jemalloc debug features to triage and resolve some nasty memory bloat issues in our code that use RockDB.
Good times.
brcmthrowaway · 12h ago
What allocator does Apple use?
forty · 10h ago
Probably iMalloc ;)
half-kh-hacker · 10h ago
you probably want to look at their 'libmalloc'
skeptrune · 17h ago
Kind of nuts that he worked on Jemalloc for over a decade while having personal preference for garbage collection. I'm surprised he doesn't have more regret.
kstrauser · 16h ago
Why are those two mutually exclusive? I'd think that a high performance allocator would be especially crucial in the implementation of a fast garbage collected language. For example, in Python you can't alloc(n * sizeof(obj)) to reserve that much contiguous space for n objects. Instead, you use the builtins which isolate you from that low-level bookkeeping. Those builtins have to be pretty fast or performance would be terrible.
fermentation · 14h ago
A job is a job
userbinator · 16h ago
A bad choice of title, as "postmortem" made me think there was some severe outage caused by jemalloc.
stingraycharles · 16h ago
I think this implies your understanding of the term “post-mortem” is incorrect, rather than the title.
drysine · 11h ago
Or maybe not
chrisweekly · 16h ago
Well, that's not the only meaning of "postmortem". The fine article does open with,
"The jemalloc memory allocator was first conceived in early 2004, and has been in public use for about 20 years now. Thanks to the nature of open source software licensing, jemalloc will remain publicly available indefinitely. But active upstream development has come to an end. This post briefly describes jemalloc’s development phases, each with some success/failure highlights, followed by some retrospective commentary."
runevault · 16h ago
postmortem is looking back after an event. That can be a security event/outage, it can also be the completion of a project (see: game studios often do postmortems once their game is out to look back on what went wrong and right between preproduction, production, and post launch).
gilgoomesh · 15h ago
It's weird that we use "postmortem" in those cases since the word literally means "after death"; kind of implying something bad happened. I get that most of these postmortems are done after major development ceases, so it kind of is "dead" but still.
Surely a "retrospective" would be a better word for a look back. It even means "look back.
simonask · 12h ago
It gets even better. Some companies use "mid-mortems", which are evaluation and reflection processes in the middle of a project...
meepmorp · 6h ago
sounds like an appropriate way to talk about death march projects, tbh
bmacho · 8h ago
The last part is unfortunate. However, it is a perfectly fine choice of title, as it does not make the majority of us think that there were an outage caused by jemalloc. You should update how you think of the word, and align it with the majority usage
I’ve been meaning to ask Qi if he’d be open to cutting a final 6.0 release on the repo before re-archiving.
At the same time it’d be nice to modernize the default settings for the final release. Disabling the (somewhat confusingly backwardly-named) “cache oblivious” setting by default so that the 16 KiB size-class isn’t bloated to 20 KiB would be a major improvement. This isn’t to disparage your (i.e. Jason’s) original choice here; IIRC when I last talked to Qi and David about this they made the point that at the time you chose this default, typical TLB associativity was much lower than it is now. On a similar note, increasing the default “page size” from 4 KiB to something larger (probably 16 KiB), which would correspondingly increase the large size-class cutoff (i.e. the point at which the allocator switches from placing multiple allocations onto a slab, to backing individual allocations with their own extent directly) from 16 KiB up to 64 KiB would be pretty impactful. One of the last things I looked at before leaving Meta was making this change internally for major services, as it was worth a several percent CPU improvement (at the cost of a minor increase in RAM usage due to increased fragmentation). There’s a few other things I’d tweak (e.g. switching the default setting of metadata_thp from “disabled” to “auto”, changing the extent-sizing for slabs from using the nearest exact multiple of the page size that fits the size-class to instead allowing ~1% guaranteed wasted space in exchange for reducing fragmentation), but the aforementioned settings are the biggest ones.
A fellow traveler, ahoy!
SGI's decision to built out Itanium systems may have helped precipitate their own downfall. That was sad.
What's hard about using TCMalloc if you're not using bazel? (Not asking to imply that it's not, but because I'm genuinely curious.)
1. Use it as a dynamically linked library. This is not great because you’re taking at a minimum the performance hit of going through the PLT for every call. The forfeited performance is even larger if you compare against statically linking with LTO (i.e. so that you can inline calls to malloc, get the benefit of FDO , etc.). Not to mention all the deployment headaches associated with shared libraries.
2. Painfully manually create a static library. I’ve done this, it’s awful; especially if you want to go the extra mile to capture as much performance as possible and at least get partial LTO (i.e. of TCMalloc independent of your application code, compiling all of TCMalloc’s compilation units together to create a single object file).
When I was at Meta I imported TCMalloc to benchmark against (to highlight areas where we could do better in Jemalloc) by pain-stakingly hand-translating its bazel BUILD files to buck2 because there was legitimately no better option.
As a consequence of being so hard to use outside of Google, TCMalloc has many more unexpected (sometimes problematic) behaviors than Jemalloc when used as a general purpose allocator in other environments (e.g. it basically assumes that you are using a certain set of Linux configuration options [1] and behaves rather poorly if you’re not)
[1] https://google.github.io/tcmalloc/tuning.html#system-level-o...
As I observed when I was at Google: tcmalloc wasn't a dedicated team but a project driven by server performance optimization engineers aiming to improve performance of important internal servers. Extracting it to github.com/google/tcmalloc was complex due to intricate dependencies (https://abseil.io/blog/20200212-tcmalloc ). As internal performance priorities demanded more focus, less time was available for maintaining the CMake build system. Maintaining the repo could at best be described as a community contribution activity.
> Meta’s needs stopped aligning well with those of external uses some time ago, and they are better off doing their own thing.
I think Google's diverged from the external uses even long ago:) (For a long time google3 and gperftools's tcmalloc implementations were so different.)
And if you try to build their libraries from source, that involves downloading tens of gigabytes of sysroots and toolchains and vendored dependencies.
Oh and you probably don't want multiple versions of a library in your binary, so be prepared to use Google's (probably outdated) version of whatever libraries they vendor.
And they make no effort what so ever to distinguish between public header files and their source code, so if you wanna package up their libraries, be prepared to make scripts to extract the headers you need (including headers from vendored dependencies), you can't just copy all of some 'include/' folder.
And their public headers tend to do idiotic stuff like `#include "base/pc.h"`, where that `"base/pc.h"` path is not relative to the file doing the include. So you're gonna have to pollute the include namespace. Make sure not to step on their toes! There's a lot of them.
I have had the misfortune of working with Abseill, their WebRTC library, their gRPC library and their protobuf library, and it's all terrible. For personal projects where I don't have a very, very good reason to use Google code, I try to avoid it like the plague. For professional projects where I've had to use libwebrtc, the only reasonable approach is to silo off libwebrtc into its own binary which only deals with WebRTC, typically with a line-delimited JSON protocol on stdin/stdout. For things like protobuf/gRPC where that hasn't been possible, you just have to live with the suffering.
..This comment should probably have been a blog post.
And I build all Rust crates (including rustc) with -O3, same as C/C++.
Well, I remember one - very biased - example where I had a look at a class that was especially expensive to compile, like 40 seconds (on a Ryzen 7950X) and maybe 2 GB of RAM. It had under 200 LOC and didn't seem to do anything that's typically expensive to compile... except for the stuff it included. Which also didn't seem to do anything fancy. But transitive includes can snowball if you don't add any "compile firewalls".
That's another matter - just because forward-declares are allowed, doesn't mean they are mandated, but in my experience the reviewers were paying attention to that pretty well.
Counter-exeamples to "~all includes": https://source.chromium.org/chromium/chromium/src/+/main:thi..., https://source.chromium.org/chromium/chromium/src/+/main:thi..., https://source.chromium.org/chromium/chromium/src/+/main:thi....
I picked couple random headers from the directory where I've contributed the most to blink, and from what I'm seeing, most of the classes that could be forward-declared, were. I have not looked at .cc files given that those tend to need to see the declaration (except when it's unused, but then why have a forward-decl at all?) or the compiler would complain about access into incomplete type.
> Well, I remember one - very biased - example where I had a look at a class that was especially expensive to compile, like 40 seconds (on a Ryzen 7950X) and maybe 2 GB of RAM. It had under 200 LOC and didn't seem to do anything that's typically expensive to compile... except for the stuff it included.
Maybe the stuff was actually being compiled because of some member in a class (so it was actually expensive to compile). Or maybe you stumbled upon a place where folks weren't paying attention. Hard to say without a concrete example. The "compile firewall" was added pretty recently I think, but I don't know if it's going to block anything from landing.
Edit: formatting (switched bulleted list into comma-separated because clearly I don't know how to format it).
And the include graph analysis: https://commondatastorage.googleapis.com/chromium-browser-cl...
The annotated red dots correspond to the last time Chrome developers did a big push to prune the include graph to optimize build time. It was effective, but there was push back. C++ developers just want magic, they don't want to think about dependency management, and it's hard to blame them. But, at the end of the day, builds scale with sources times dependencies, and if you aren't disciplined, you can expect superlinear build times.
110 CPU hours for a build. (Fortunately, it seems to be a little over half that for my CPU. "Cloud CPUs" are kinda slow.)
I picked the 5001st largest file with includes. It's zoom_view_controller.cc, 140 lines in the .cc file, size with includes: 19.5 MB.
Initially I picked the 5000th largest file with includes, but for devtools_target_ui.cc, I see a bit more legitimacy for having lots of includes. It has 384 "own" lines in he .cc file and, of course, also about 19.5 MB size with includes.
A C++20 source file including some standard library headers easily bloats to a little under 1 MB IIRC, and that's already kind of unreasonable. 20x of that is very unreasonable.
I don't think that I need to tell anyone on the Chrome team how to improve performance in software: you measure and then you tackle the dumb low-hanging fruit first. From these results, it doesn't seem like anyone is working with the actual goal to improve the situation as long as the guidelines are followed on paper.
> Oh and you probably don't want multiple versions of a library in your binary, so be prepared to use Google's (probably outdated) version of whatever libraries they vendor.
This is the only complaint I can relate to. Sometimes they lag on rolling dependencies forward. Not so infrequently there are minor (or not so minor) issues when I try to do so myself and I don't want to waste time patching my dependencies up so I get stuck for a while until they get around to it. That said, usually rolling forward works without issue.
> if you try to build their libraries from source, that involves downloading tens of gigabytes of sysroots and toolchains and vendored dependencies.
Out of curiosity which project did you run into this with? That said, isn't the only alternative for them moving to something like nix? Otherwise how do you tightly specify the build environment?
> Out of curiosity which project did you run into this with?
Their WebRTC library for the most part, but also the gRPC C++ library. Unlike WebRTC, grpc++ is in most package managers so the need to build it myself is less, but WebRTC is a behemoth and not in any package manager.
> That said, isn't the only alternative for them moving to something like nix? Otherwise how do you tightly specify the build environment?
I don't expect my libraries to tightly specify the build environment. I expect my libraries to conform to my software's build environment, to use versions of other libraries that I provide to it, etc etc. I don't mind that Google builds their application software the way they do, Google Chrome should tightly constrain its build environment if Google wants; but their libraries should fit in to my environment.
I'm wondering, what is your relationship with Google software that you build from source? Are you building their libraries to integrate with your own applications, or do you just build Google's applications from source and use them as-is?
If you are building a single google project they are easy to get up and running. If you are building your own project on top of theirs things get difficult. those library issues will get you.
I don't know about OP, but we have our own in house package manager. If Conan was ready a couple years sooner we would have used that instead.
Upside is (I guess) if I ever want to use grpc in another project the work's already done and it'll just be a matter of copy/paste.
The counter example is the language Go. The team running Go has put considerable care and attention into making this project welcoming for developers to contribute, while still adhering to Google code contribution requirements. Building for source is straightforward and iirc it’s one of the easier cross compilers to setup.
Install docs: https://go.dev/doc/install/source#bootstrapFromBinaryRelease
Or rather it was the last time I tried.
They did, in a different way. The world is used to distinguish by convention, putting them in different directory hierarchy (src/, include/). google3 depends on the build system to do so, "which header file is public" is documented in BUILD files. You are then required to use their build system to grasp the difference :(
> And their public headers tend to do idiotic stuff like `#include "base/pc.h"`, where that `"base/pc.h"` path is not relative to the file doing the include.
I have to disagree on this one. Relying on relative include paths suck. Just having one `-I/project/root` is the way to go.
Oh to be clear, I'm not saying that they should've used relative includes. I'm complaining that they don't put their includes in their own namespace. If public headers were in a folder called `include/webrtc` as is the typical convention, and they all contained `#include <webrtc/base/pc.h>` or `#include "webrtc/base/pc.h"` I would've had no problem. But as it is, WebRTC's headers are in include paths which it's really difficult to avoid colliding with. You'll cause collisions if your project has a source directory called `api`, or `pc`, or `net`, or `media`, or a whole host of other common names.
Now I'm curious why grpc, webrtc and some other Chromium repos were set up like this. Google projects which started in google3 and later exported as an open source project don't have this defect, for example tensorflow, abseil etc. They all had a top-level directory containing all their codes so it becomes `#include "tensorflow/...`.
Feels like a weird collision of coding style and starting a project outside of their monorepo
> I have to disagree on this one.
The double-quotes literally mean "this dependency is relative to the current file". If you want to depend on a -I, then signal that by using angle brackets.
I would be surprised if I read some project's code where angle brackets are used to include headers from within the same project. I'm not surprised when quotes are used to include code from within the project but relative to the project's root.
Thanks again. This is far outside my regular work, but it fascinates me.
(I managed to avoid infecting my project with boringSSL)
The reverse should be much easier, which was the point of the post. Pointing it out as a capability (translation of build systems) that is handled well, is, well, informative. The future isn’t evenly distributed and people aren’t always aware of capabilities, even on HN
Why not? I mean this is complete drive-by comment, so please correct me, but there was a fully staffed team at Meta that maintained it, but was not in the best place to manage the issues?
custom-malloc-newbie question: Why is the choice of build system (generator) significant when evaluating the usability of a library?
I've done both before, and seen libraries at various levels of complexity; there is definitely a point where you just want to give up and not use the thing when it's very complex.
One fine day, we discovered Jemalloc and put it in our service, which was causing a lot of memory fragmentation. We did not think that those 2 lines of changes in Dockerfile were going to fix all of our woes, but we were pleasantly surprised. Every single issue went away.
Today, our multi-million dollar revenue company is using your memory allocator on every single service and on every single Dockerfile.
Thank you! From the bottom of our hearts!
the top 3 from https://github.com/topics/resize-images (as of 2025-06-13)
imaginary: https://github.com/h2non/imaginary/blob/1d4e251cfcd58ea66f83...
imgproxy: https://web.archive.org/web/20210412004544/https://docs.imgp... (linked from a discussion in the imaginary repo)
imagor: https://github.com/cshum/imagor/blob/f6673fa6656ee8ef17728f2...
https://github.com/libvips/libvips/discussions/3019
FWIW while it was a factor it was just one of a number: https://github.com/rust-lang/rust/issues/36963#issuecomment-...
And jemalloc was only removed two years after that issue was opened: https://github.com/rust-lang/rust/pull/55238
I wonder if some kind of dynamic page-size (with dynamic ftrace-style binary patching for performance?) would have been that much slower.
I learned of it from it's integration in FreeBSD and never looked back.
jemalloc has help entertained a lot of people :)
windows def allocator is pos. Jemalloc rules
Which one of them? These days it could mean HeapAlloc, or it could mean malloc from uCRT.
Wow, still? I remember allocator benchmarks from 10-15 years ago where there were some notable differences between allocators... and then Windows with like 20% the performance of everything else!
Or I wonder if they could simply use tcmalloc or another allocator these days?
Facebook infrastructure engineering reduced investment in core technology, instead emphasizing return on investment.
> Or I wonder if they could simply use tcmalloc or another allocator these days?
Jemalloc is very deeply integrated there, so this is a lot harder than it sounds. From the telemetry being plumbed through in Strobelight, to applications using every highly Jemalloc-specific extension under the sun (e.g. manually created arenas with custom extent hooks), to the convergent evolution of applications being written in ways such that they perform optimally with respect to Jemalloc’s exact behavior.
https://github.com/facebook/jemalloc
> as a result of recent changes within Meta we no longer have anyone shepherding long-term jemalloc development with an eye toward general utility
> we reached a sad end for jemalloc in the hands of Facebook/Meta
> Meta’s needs stopped aligning well with those of external uses some time ago, and they are better off doing their own thing.
How can I find out if Facebook's focus is aligned with my own needs?
> And people find themselves in impossible situations where the main choices are 1) make poor decisions under extreme pressure, 2) comply under extreme pressure, or 3) get routed around.
It doesn't sound like a work place :-(
- fsociety
And before that, open office plans.
You're saving on rent: great. But what is it doing to productivity?
* https://business.adobe.com/blog/perspectives/what-science-sa...
Of course productivity doesn't show up on a spreadsheet, but rent does, so it's what about "the numbers" say.
No comments yet
Though this was not the post I was expecting to show up today, it was super awesome for me to get to have played my tiny part in this big journey. Thanks for everything @je (and qi + david -- and all the contributors before and after my time!).
I will say, the Facebook people were very excited to share jemalloc with us when they acquired my employer, but we were using FreeBSD so we already had it and thought it was normal. :)
... its license is BSD-2-Clause ;)
hence "political"
The problem is exactly this: Facebook becomes the upstream of a key part of your system.
And Facebook can just walk away from the project. Like it did just now.
If it were included it would instantly become a LGPL hard-fork because of any subsequently added line of code, if not by "virality" of the glibc license, at least because any glibc author code addition would be LGPL, per GNU project policy/ideology.
Also also this would he a hard bar to pass: https://sourceware.org/glibc/wiki/CopyrightFSForDisclaim
As I recall this is what prevented Apple from contributing C blocks† back to upstream GCC.
† https://github.com/lloeki/cblocks-clobj
Llvm is OK for them from this point of view: upstream is open but they can maintain and distribute their proprietary fork.
Specifically regarding the C blocks feature introduced in Snow Leopard, as I recall, Apple wrote implementations for both clang and gcc, attempted to upstream the gcc patchset, said gcc patchset was obviously under a GPL license, but the GCC team threw a fit because it wanted the code copyright to be attributed to the FSF, and that ended up as a stalemate.
If there was any hatred they could literally have skipped the whole gcc implementation + patchset upstreaming attempt altogether. Also they did have patchsets of various sizes on other projects, whose code ends up obviously being GPL as well.
The "hatred" came later with the GPLv3 family and the patent clause, which is a legal landmine, the FSF stating that signing apps is incompatible with the GPLv3, and getting hung up on copyright transfer.
From https://lwn.net/Articles/405417/
> Apple's motives may not be pure, but it has published the code under the license required and it's the FSF's own copyright assignment policies that block the inclusion. The code is available and licensed appropriately for the version of GCC that Apple adopted. It might be nice if Apple took the further step of assigning copyright to the FSF, but the GPLv3 was not part of the bargain that Apple agreed to when it first started contributing to GCC.
The intent behind such copyright transfer is generally so that the recipient of the transfer can relicense without having to ask all contributors. Essentially as a contributor agreeing to a transfer means ceding up control on the license that you initially contributed under.
Read another way:
- the FSF says "this code is GPLv2"
- someone contributes under that GPLv2 promise, cedes copyright to the FSF because it's "the process"
- the FSF says "this code is now GPLv3 exclusively"
- that someone says "but that was not the deal!"
- the FSF says "I am altering the deal, pray I don't alter it any further."
License switches do happen though, and are the source of outrage. Cue redis.
The cause of transferring copyright is often practical (hard to track down + reach out to + gather answers from all authors-slash-contributors which hampers some critical decisions down the road); for the FSF it's ideological (GCC source code must remain under sole FSF control).
The consequence of the transfer though is not well understood by authors forfeiting their copyright: they essentially agree to work for free for whatever the codebase ends up being licensed to in the future, including possibly becoming entirely closed source.
Think of it next time you sign a CLA!
A while back, I had a conversation with an engineer who maintained an OS allocator, and their claim was that custom allocators tend to make one process's memory allocation faster at the expense of the rest of the system. System allocators are less able to make allocation fair holistically, because one process isn't following the same patterns as the rest.
Which is why you see it recommended so frequently with services, where there is generally one process that you want to get preferential treatment over everything else.
It would be interested in hearing their thoughts directly, I'm also not an allocator engineer and someone who maintains an OS allocator probably knows wayyy more about this stuff than me. I'm sure there's some missing nuance or context or which would've made it make sense.
They were mostly optimised on Facebook/Google server-side systems, which were likely one application per VM, no? (Unlike desktop usage where users want several applications to run cooperatively). Firefox is a different case but apparently mainline jemalloc never matched Firefox jemalloc, and even then it's entirely plausible that Firefox benefitted from a "selfish" allocator.
The whole advantage of being the platform's system allocator is that you can have a tighter relationship between the library function and the kernel implementation.
Returning pages allows them to be used for disk cache. They can be zeroed in the background by the kernel which may save time when they're needed again, or zeroing can be avoided if the kernel uses them as the destination of a full page DMA write.
Also, returning no longer used pages helps get closer to a useful memory used measurement. Measuring memory usage is pretty difficult of course, but making the numbers a little more accurate helps.
There's also the fact that ... a lot of processes only ever have a single thread, or at most have a few background threads that do very little of interest. So all these "multi-threading-first allocators" aren't actually buying anything of value, and they do have a lot of overhead.
Semi-related: one thing that most people never think about: it is exactly the same amount of work for the kernel to zero a page of memory (in preparation for a future mmap) as for a userland process to zero it out (for its own internal reuse)
Possibly more work since the kernel can't use SIMD
When a syscall is made, the kernel has to backup the user mode state of the thread, so it can restore it later.
If any kernel code could use SIMD registers, you'll have to backup and restore that too, and those registers get big. You could easily be looking at adding a 1kb copy to every syscall, and most of the time it wouldn't be needed.
If no syscalls use SIMD today, I’d think we’re starting from a safe position.
https://www.kernel.org/doc/html/next/core-api/floating-point...
jemalloc is always the first thing I installed whenever I had to provision bare servers.
If jemalloc is somehow the default allocator in Linux, I think it will not have a hard time retaining contributors.
Even from a quick look at the open issues, I can see https://github.com/jemalloc/jemalloc/issues/2838, and https://github.com/jemalloc/jemalloc/issues/2815 as two examples, but there's a fair number of issues still open against the repository.
So that'll leave projects like redis & valkey with some decisions to make.
1) Keep jemalloc and accept things like memory leak bugs
2) Fork and maintain their own version of jemalloc.
3) Spend time replacing it entirely.
4) Hope someone else picks it up?
As much as it's nice to think software can be done, I think something so closely tied to the kernel and hardware and the application layer, which all change constantly, never can be.
https://www.kernelconfig.io/config_transparent_hugepage
Lol absolutely not
He and the other previous contributors are free to find new employers to continue such an arrangement, if any are willing to make that investment. Alternatively they could cobble together funding from a variety of smaller vendors. I think the author is happy to move on to other projects, after spending a long time in this problem space.
I don’t think that “don’t let one megacorp hire a team of contributors for your FOSS project” is the lesson here. I’d say it’s a lesson in working upstream - the contributions made during their Facebook / Meta investment are available for the community to build upon. They could’ve just as easily been made in a closed source fork inside Facebook, without violating the terms of the license.
Also Mozilla were unable to switch from their fork to the upstream version, and didn’t easily benefit from the Facebook / Meta investment as a result.
In fact, Jason himself (the author of jemalloc and TFA) posted an article on glibc malloc fragmentation 15 years ago: https://web.archive.org/web/20160417080412/http://www.canonw...
And it's an issue to this day: https://blog.arkey.fr/drafts/2021/01/22/native-memory-fragme...
In my experience it delays it way too much, causing memory overuse and OOMs.
I have a Python program that allocates 100 GB for some work, free()s it, and then calls a subprocess that takes 100 GB as well. Because the memory use is serial, it should fit in 128 GB just fine. But it gets OOM-killed, because glibc does not turn the free() into an munmap() before the subprocess is launched, so it needs 200 GB total, with 100 GB sitting around pointlessly unused in the Python process.
This means if you use glibc, you have no idea how much memory your system will use and whether they will OOM-crash, even if your applications are carefully designed to avoid it.
Similar experience: https://news.ycombinator.com/item?id=24242571
I commented there 4 years ago the glibc settings MALLOC_MMAP_THRESHOLD_ and MALLOC_TRIM_THRESHOLD_ should fix that, but I was wrong: MALLOC_TRIM_THRESHOLD_ is apparently bugged and has no effect in some situations.
A bug I think might be involved: "free() doesn't honor M_TRIM_THRESHOLD" https://sourceware.org/bugzilla/show_bug.cgi?id=14827
Open since 13 years ago. This stuff doesn't seem to get fixed.
The fix in general is to use jemalloc with
which tells it to immediately munmap() at free().So in jemalloc, the settings to control this behaviour seem to actually work, in contrast to glibc malloc.
(I'm happy to be proven wrong here, but so far no combination of settings seem to actually make glibc return memory as written in their docs.)
From this perspective, it is frightening to see the jemalloc repo being archived, because that was my way to make sure stuff doesn't OOM in production all the time.
How would the allocator know that some block is unused, short of `free` being called? Does glibc not return all memory after a `free`? Do other allocators do something clever to automatically release things? Is there just a lot of bookkeeping overhead that some allocators are better at handling?
First, some background: no allocator will return memory back to the kernel for every `free`. That's for performance and memory consumption reasons: the smallest unit of memory you can request from and return to the kernel is a page (typically 4kiB or 16kiB), and requesting and returning memory (typically called "mapping" and "unmapping" memory in the UNIX world) has some performance overhead.
So if you allocate space for one 32-byte object for example, your `malloc` implementation won't map a whole new 4k or 16k page to store 32 bytes. The allocator probably has some pages from earlier allocations, and it will make space for your 32-byte allocation in pages it has already mapped. Or it can't fit your allocation, so it will map more pages, and then set aside 32 bytes for your allocation.
This all means that when you call `free()` on a pointer, the allocator can't just unmap a page immediately, because there may be other allocations on the same page which haven't been freed yet. Only when all of the allocations which happen to be on a specific page are freed, can the page be unmapped. In a worst-case situation, you could in theory allocate and free memory in such a way that you end up with 100 1-byte allocations allocated across 100 pages, none of which can be unmapped; you'd be using 400kiB or 1600kiB of memory to store 100 bytes. (But that's not necessarily a huge problem, because it just means that future allocations would probably end up in the existing pages and not increase your memory consumption.)
Now, the glibc-specific quirk: glibc will only ever unmap the last page, from what I understand. So you can allocate megabytes upon megabytes of data, which causes glibc to map a bunch of pages, then free() every allocation except for the last one, and you'd end up still consuming many megabytes of memory. Glibc won't unmap those megabytes of unused pages until you free the allocation that sits in the last page that glibc mapped.
This typically isn't a huge deal; yes, you're keeping more memory mapped than you strictly need, but if the application needs more memory in the future, it'll just re-use the free space in all the pages it has already mapped. So it's not like those pages are "leaked", they're just kept around for future use.
It can sometimes be a real problem though. For example, a program could do a bunch of memory-intensive computation on launch requiring gigabytes of memory at once, then all that computation culminates in one relatively small allocated object, then the program calls free() on all the allocations it did as part of that computation. The application could potentially keep around gigabytes worth of pages which serve no purpose but can't be unmapped due to that last small allocation.
If any of this is wrong, I would love to be corrected. This is my current impression of the issue but I'm not an authoritative source.
1. `malloc()` is usually called with sizes smaller than the sizes by which the allocator requests memory from the OS, which are at least page-sized (4096 bytes on x86/x86-64) and often much larger. After a `free()`, the freed memory can't be returned to the OS because it's only a small chunk in a larger OS allocation. Only after all memory within a page has been `free()`d, the allocator may, but doesn't have to, return that page back to the OS.
2. After a `free()`, the allocator wants to hang on to that memory area because the next `malloc()` is sure to follow soon.
This is a very simplified overview, and different allocators have different strategies for gathering new `malloc()`s in various areas and for returning areas back to the OS (or not).
So the flow is like this: user has an allocation looking issue. Picks up $allocator. If they have an $allocator type problem then they keep using it, otherwise they use something else.
There are tons of users if these allocators but many rarely engage with the developers. Many wouldn’t even notice improvements or regressions on upgrades because after the initial choice they stop looking.
I’m not sure how to fix that, but this is not healthy for such projects.
I wonder if you did get everything you should from the companies that use it. I mean sometimes I feel that big tech firms only use free software, never giving anything to it, so I hope you were the exception here.
I'm not really sure what I expected, but somehow I expect a memory allocator to be ... smaller, simpler perhaps?
However it is typically always more complex to make production quality software, especially in a performance sensitive domain.
I think we lost a great deal of potential when ORCA was too tied to Pony and not extracted to a framework, tool, and/or library useful outside of it such as integrated or working with LLVM.
You can write a naive mark-and-sweep in an afternoon. You can write a reference counter in even less time. And for some runtimes this is fine.
But writing a generational, concurrent, moving GC takes a lot of time. But if you can achieve it, you can get amazing performance gains. Just look at recent versions of Java.
Good times.
"The jemalloc memory allocator was first conceived in early 2004, and has been in public use for about 20 years now. Thanks to the nature of open source software licensing, jemalloc will remain publicly available indefinitely. But active upstream development has come to an end. This post briefly describes jemalloc’s development phases, each with some success/failure highlights, followed by some retrospective commentary."
Surely a "retrospective" would be a better word for a look back. It even means "look back.