Show HN: eBPF memory profiling at OOM kill time

5 gnurizen 1 8/13/2025, 2:52:33 PM polarsignals.com ↗

Comments (1)

Bender · 1h ago

On a related note and FWIW OOM kills can be reduced or entirely mitigated by a combination of some kernel settings and confining applications to CGroups. Some of the generalized and basic settings are:

- vm.overcommit_ratio should be set to 0 on non development machines. Some applications are greedy and do not play well when memory is constrained so that requires working with the application developers to improve their memory management.

- vm.min_free_kbytes which should be set based on a formula. Redhat had a decent formula but DBA's don't like when it is used because they want every last bit of ram so that battle is left to the sysadmins. It becomes a circular argument and ever-increasing memory purchases.

- vm.admin_reserve_kbytes and vm.user_reserve_kbytes also based on a formula but each company will have to come up with their own based on how fast memory is allocated by their in-house applications.

- vm.swappiness and vm.vfs_cache_pressure will also vary by the role of the server and intended uses but having higher cache pressure can reduce the risk of a race condition that leads to OOM on systems with low free memory.

- vm.compaction_proactiveness settings can cause lag and race conditions on systems under high memory pressure. This can be exacerbated by /sys/kernel/mm/transparent_hugepage/enabled and /sys/kernel/mm/transparent_hugepage/defrag settings among other related settings but that's a long story with a lot of variable conditions especially on systems with TB's of ram and long uptimes.

There is no magic bullet or "correct" setting. All of these will vary by roles, usage, etc... I will leave it to the reader to research each setting. This will require some load testing that puts each system under real world high memory pressure. Oh and if someone starts trying to solve this with swap, just slowly back away and don't make any sudden movements. Distract them and run. I managed over 50K servers with anywhere from 96GB to 3TB of ram and the only OOM's were caused by operations teams that had full control over memory allocation of java and wanted to use every last KB of physical ram installed not factoring in memory outside of the heap. None of these servers had any swap as it would have to be encrypted and nobody wanted to deal with that. OOM's are almost always PEBKAC.

I Stole a Domain from Taylor Swift (tsyt.ink)

Canonical's hiring process is a case study in crapness (tomkranz.com)

Answering the BfDI's questions on personal data in LLMs (desfontain.es)

Angel Investors, a Field Guide (jeanyang.com)

New Study Suggests Using AI Made Doctors Less Skilled at Spotting Cancer (time.com)

Machines may see cause and effect in problematic new ways (spectrum.ieee.org)

Why China is becoming the first electrostate (abc.net.au)

Less Impressed, More Involved: A Guide to Using Claude Code Effectively (robmoore.tech)

Things I said as a manager part 4: Always be pitching (reactiverobot.com)

The Missing Service: Notify One Time (deanebarker.net)

Direct observation of coherent elastic antineutrino–nucleus scattering (nature.com)

Ask HN: How good is MIT at teaching graph algorithms?

Show HN: Index of fresh-capital VCs, accelerators, grants and –$4.5M freebies (freestartupfunding.com)

New AI Platform StayModern Helps Small Businesses Transition to AI (staymodern.ai)

A new future for icanhazip (2021) (major.io)

Show HN: TrueSift – AI-Powered Real-Time Fact-Checking Chrome Extension (truesift.dev)

Designing with AI, Not Around It (smashingmagazine.com)

Three tragedies that shape human life in age of AI and their antidotes (link.springer.com)

You can't UPDATE what you can't find: ClickHouse vs. PostgreSQL (clickhouse.com)

Show HN: Trim – interactive, personalized summaries for students (app.trim.run)

SHOW HN:Claim your AI agent's unique name before someone else does

Why Clojure? (blog.cleancoder.com)

Altman Backs brain chip startup. Strategic? Or to piss of Musk? (qz.com)

Quality Wednesdays: How Linear trained its team to see what doesn't work (linear.app)

Attorney General James Sues Company Behind Zelle for Enabling Widespread Fraud (ag.ny.gov)

webOS Samples (webosose.org)

Show HN: The U Programming Language (gist.github.com)

Bias as a Fix for Congestion (gojiberries.io)

New York AG James sues Zelle parent company for alleged fraud (cnbc.com)

RoboCop Rogue City (robocop-roguecity.com)

"Bullshit Index" Tracks AI Misinformation (spectrum.ieee.org)

Rise of the Everything Apps (dinoki.substack.com)

AI Is Different (antirez.com)

Arch shares its wiki strategy with Debian (lwn.net)

Ask HN: In self-serve SaaS – how do you get paying users on exploration calls?

Hyder and Stewart: A Tale of Two Border Towns (2018) (johnzada.com)

Seeing Growing Exodus, State Organ Donor Registries Urge 'Perspective' (newsweek.com)

Ask HN: Going to bed with *unsolved* problems in your head?

Berkshire Hathaway's Website looks like it's from the 90s (berkshirehathaway.com)

Beyond Parity: The Case for True Accessibility Affordances (devinprater.micro.blog)

Unplugged – co-founded by Erik Prince – releases new "privacy-first" smartphone (theverge.com)

We found TeaOnHer spilling users' driver's licenses in less than 10 minutes (techcrunch.com)

Dolthub/go-MySQL-server: A MySQL-compatible database, in pure Go (github.com)

Show HN: XferLang, a data-transfer and configuration alternative to JSON (xferlang.org)

Designing the Built-In AI Web APIs (domenic.me)

AI Therapy Bot

The Great Geothermal Talent Shortage (oilprice.com)

Multi-Dimensional Vector Support in CocoIndex – Underneath Explained (cocoindex.io)

Gemini adds Temporary Chats and new personalization features (blog.google)

Why Are Digital Systems Failing the People They're Meant to Serve? (syntheticauth.ai)

Show HN: eBPF memory profiling at OOM kill time

Comments (1)

Ask HN: Going to bed with unsolved problems in your head?