Fields where Native Americans farmed a thousand years ago discovered in Michigan (smithsonianmag.com)

My summary is that it's about one or two allocations per nanosecond on CF Bolz's machine, an AMD Ryzen 7 PRO 7840U, presumably on one core, and it's about 11 instructions per allocation.

This is about 2–4× faster than my pointer-bumping arena allocator for C, kmregion†, which is a similar number of instructions on the (inlined) fast path. But possibly that's because I was testing on slower hardware. I was also testing with 16-byte initialized objects, but without a GC. It's about 10× the speed of malloc/free.

I don't know that I'd recommend using kmregion, since it's never been used for anything serious, but it should at least serve as a proof of concept.

______

† http://canonical.org/~kragen/sw/dev3/kmregion.h http://canonical.org/~kragen/sw/dev3/kmregion.c http://canonical.org/~kragen/sw/dev3/kmregion_example.c

charleslmunger · 2h ago

I simulated yours vs the UPB arena fast path:

https://godbolt.org/z/1oTrv1Y58

Messing with it a bit, it seems like yours has a slightly shorter dependency chain due to loading the two members separately, where UPB loads them as a pair (as it needs both in order to determine how much size is available). Also seems to have less register pressure. I think that's because yours bumps down. UPB's supports in place forward extension, so it needs to bump up.

If you added branch hints to signal to the compiler that your slow path is not often hit, you might see some improvement (although if you have PGO it should already do this). These paths could also be good candidates for the `preserve_most` calling convention.

However, there is an unfortunate compiler behavior here for both implementations - it doesn't track whether the slow path (which is not inlined, and clobbers the pointers) was actually hit, so it reloads on the hot path, for both approaches. Unfortunately this means that a sequence of allocations will store and load the arena pointers repeatedly, when ideally they'd keep the current position in a register on the hot path and refill that register after clobbering in the cold path.

kragen · 2h ago

Thank you very much! I vaguely remember that it did that, and the failure to keep the pointer in registers might explain why PyPy's version is twice as fast (?).

runlaszlorun · 4h ago

I don't know much about language internals or allocation but am learning. why this could be significantly faster than a bump/arena allocator?

And is the speed up over malloc/free due to large block allocation as opposed to individual malloc?

kragen · 4h ago

It is a bump allocator. I don't know why it's so much faster than mine, but my hypothesis was that CF Bolz was testing on a faster machine. The speedup over malloc/free is because bumping a pointer is much faster than calling a subroutine.

pizlonator · 5h ago

The reason why their allocator is faster than Boehm isn't because of conservative stack scanning.

You can move objects while using conservative stack scanning. This is a common approach. JavaScriptCore used to use it.

You can have super fast allocation in a non-moving collector, but that involves an algorithm that is vastly different from the Boehm one. I think the fastest non-moving collectors have similar allocation fast paths to the fastest moving collectors. JavaScriptCore has a fast non-moving allocator called bump'n'pop. In Fil-C, I use a different approach that I call SIMD turbosweep. There's also the Immix approach. And there are many others.

forrestthewoods · 3h ago

> Every allocation takes 110116790943 / 10000000000 ≈ 11 instructions and 21074240395 / 10000000000 ≈ 2.1 cycles

I don’t believe this in even the slightest. That is not a meaningful metric for literally any actual workload in the universe. It defies common sense.

A few years ago I ran some benchmarks on an old but vaguely reasonable work load. I came up with a p95 or just 25nanoseconds but p99.9 on the order of tens of microseconds. https://www.forrestthewoods.com/blog/benchmarking-malloc-wit...

Of course “2% of time in GC” is doing a lot of heavy lifting here. But I’d really need to see a real work load for me to start to believe.

kragen · 3m ago

You were measuring malloc, so of course you came up with numbers that were 20 times worse than PyPy's nursery allocator. That's because malloc is 20 times slower, whatever common sense says.

Also, you're talking about tail latency, while CF Bolz was measuring throughput. Contrary to your assertion, throughput is indeed a meaningful metric, though, especially for interactive UIs such as videogames, tail latency is often more important. For applications like compilers and SMT solvers, on the other hand, throughput matters more.

A new high-voltage breaker can clear grid-scale faults without greenhouse gas (spectrum.ieee.org)

Meta's Llama 3.1 can recall 42 percent of the first Harry Potter book (understandingai.org)

Modifying an HDMI dummy plug's EDID using a Raspberry Pi (downtowndougbrown.com)

Telephone Exchanges in the UK (telephone-exchanges.org.uk)

Twin – A Textmode WINdow Environment (github.com)

Lisp-stat: Lisp environment for statistical computing (lisp-stat.dev)

Chemical knowledge and reasoning of large language models vs. chemist expertise (nature.com)

Canyon.mid (canyonmid.com)

Why SSL was renamed to TLS in late 90s (2014) (tim.dierks.org)

Childhood leukemia: how a deadly cancer became treatable (ourworldindata.org)

First 2D, non-silicon computer developed (psu.edu)

Datalog in Rust (github.com)

Let's Talk About ChatGPT-Induced Spiritual Psychosis (default.blog)

Datalog in miniKanren (deosjr.github.io)

How to modify Starlink Mini to run without the built-in WiFi router (olegkutkov.me)

Simplest C++ Callback, from SumatraPDF (blog.kowalczyk.info)

Random Walk: A Modern Introduction [pdf] (math.uchicago.edu)

DARPA program sets distance record for power beaming (darpa.mil)

Cure Dolly's Japanese Grammar Lessons (kellenok.github.io)

David Attenborough at 99: 'I will not see how the story ends' (thetimes.com)

Fields where Native Americans farmed a thousand years ago discovered in Michigan (smithsonianmag.com)

Cyborg Embryos Offer New Insights into Brain Growth (spectrum.ieee.org)

KAIST Succeeds in Real-Time CO2 Monitoring Without Batteries or External Power (news.kaist.ac.kr)

How fast can the RPython GC allocate? (pypy.org)

The experience continues until you stop experiencing it (strangemachine.tv)

It’s nearly impossible to buy an original Bob Ross painting (2021) (thehustle.co)

Foundations of Computer Vision (visionbook.mit.edu)

An Introduction to the Hieroglyphic Language of Early 1900s Train-Hoppers (openculture.com)

The Art of Lisp and Writing (2003) (dreamsongs.com)

Ruby on Rails Audit Complete (ostif.org)

Text-to-LoRA: Hypernetwork that generates task-specific LLM adapters (LoRAs) (github.com)

Show HN: StellarSnap – Explore NASA APODs, simulate orbits, learn astronomy (stellarsnap.space)

SQLite Date and Time Functions (2007) (www2.sqlite.org)

Show HN: Seastar – Build and dependency manager for C/C++ with Cargo's features (github.com)

Jokes and Humour in the Public Android API (voxelmanip.se)

A skyscraper that could have toppled over in the wind (1995) (newyorker.com)

Social anxiety disorder-associated gut microbiota increases social fear (pnas.org)

An origin trial for a new HTML <permission> element (2024) (developer.chrome.com)

How easy is it for a developer to "sandbox" a program? (kristaps.bsd.lv)

Tiny-diffusion: A minimal implementation of probabilistic diffusion models (github.com)

GNOME and Red Hat Linux eleven years ago (2009) (linuxgazette.net)

Show HN: Personalized Wealth Management – Institutional Meets Consumer (fulfilledwealth.co)

Studio Ghibli marks 40 years, but future looks uncertain (japantimes.co.jp)

GitHub CI/CD observability with OpenTelemetry step by step guide (signoz.io)

I Miss the Internet (2024) (joanwestenberg.com)

Fake bands and artificial songs are taking over YouTube and Spotify (english.elpais.com)

Vietnam scraps two-child policy as it tackles falling birthrate (theguardian.com)

AMD's AI Future Is Rack Scale 'Helios' (morethanmoore.substack.com)

Time Machine (cinemasojourns.com)

Synthesis of hafnium carbide via one-step selective laser reaction pyrolysis (ceramics.onlinelibrary.wiley.com)

How fast can the RPython GC allocate?

Comments (8)