LLM Inevitabilism (tomrenner.com)

You always want the load average to be less than the total number of CPU cores. If higher, you're likely experiencing a lot of waits and context switching.

[1] https://www.brendangregg.com/blog/2017-08-08/linux-load-aver...

tanelpoder · 16h ago

On Linux this is not true, on an IO heavy system - with lots of synchronous I/Os done concurrently by many threads - your load average may be well over the number of CPUs, without having a CPU shortage. Say, you have 16 CPUs, load avg is 20, but only 10 threads out of 20 are in Runnable (R) mode on average, and the other 10 are in Uninterruptible sleep (D) mode. You don't have a CPU shortage in this case.

Note that synchronous I/O completion checks for previously submitted asynchronous I/Os (both with libaio and io_uring) do not contribute to system load as they sleep in the interruptible sleep (S) mode.

That's why I tend to break down the system load (demand) by the sleep type, system call and wchan/kernel stack location when possible. I've written about the techniques and one extreme scenario ("system load in thousands, little CPU usage") here:

https://tanelpoder.com/posts/high-system-load-low-cpu-utiliz...

fuy · 2h ago

Hey Tanel - I wanted to thank you for that blog post and psn tool - it recently helped me in a tricky performance investigation.

lotharcable · 16h ago

The proper way is to have a idea of what it normally is before you need to troubleshoot issues.

What is a 'good load' depends on the application and how it works. Some servers something close to 0 is a good thing. Other servers a 10 or lower means something is seriously wrong.

Of course if you don't know what is a 'good' number or you are trying to optimize a application and looking for bottlenecks then it is time to reach for different tools.

chasil · 21h ago

Glances is nice. I think it is a clone of HP-UX Glance.

https://nicolargo.github.io/glances/

I have also hacked basic top to add database login details to server processes.

Propelloni · 1d ago

Me too! So much so that I add it to my .bashrc everywhere.

__turbobrew__ · 1d ago

If you like this post, I would recommend “BPF Performance Tools” and “Systems Performance: Enterprise and the Cloud” by Brenden Gregg.

I have pulled out a few miracles using these tools (identifying kernel bottlenecks or profiling programs using ebpf) and it has been well worth the investment to read through the books.

yankcrime · 1d ago

Agreed, highly recommended reading. A slightly more up-to-date post of his which recommends tools in such situations is: https://www.brendangregg.com/blog/2024-03-24/linux-crisis-to...

wcunning · 1d ago

Literally did miracles at my last job with the first book and that got me my current job, where I also did some impressive proving which libraries had what performance with it again... Seriously valuable stuff.

__turbobrew__ · 20h ago

Yea it is kindof cheating. I was helping someone debug why their workload was soft locking. I ran the profiling tools and found that cgroup accounting for the workload was taking nearly all the cpu time on locks. From searches through linux git logs I found that cgroup accounting in older kernels had global locks. I saw that newer kernels didn’t have this, so we moved to a newer kernels and all the issues went away.

People thought I was a wizard lol.

net01 · 3h ago

I am curious, why don't you update regularly? (student here)

ch33zer · 22h ago

Almost all of these have been replaced for me with below: https://developers.facebook.com/blog/post/2021/09/21/below-t...

It is excellent and contains most things you could need. Downside is that it isn't yet a standard tool so you need to get it installed across your fleet

benreesman · 18h ago

Oh man nostalgia city. I vividly remember meeting atop time travel debugging at 3am in Menlo Park in 2012, wild times.

tomhow · 19h ago

Previously:

Linux Performance Analysis in 60,000 Milliseconds - https://news.ycombinator.com/item?id=10652076 - Nov 2015 (11 comments)

Linux Performance Analysis - https://news.ycombinator.com/item?id=10654681 - Dec 2015 (82 comments)

Linux Performance Analysis in 60k Milliseconds (2015) [pdf] - https://news.ycombinator.com/item?id=44070741 - May 2025 (1 comment)

mortar · 1d ago

2015

Previous discussions: https://news.ycombinator.com/item?id=10654681 https://news.ycombinator.com/item?id=10652076

danieldk · 1d ago

Yeah, I skipped the date and then saw Linux 3.13 in the examples.

5pl1n73r · 20h ago

After this article was written, `free -m` on many systems started to have an "available" column that shows the sum of reclaimable and free memory. It's nicer than the "-/+" section shown in this old article.

  $ free -m
                 total        used        free      shared  buff/cache   available
  Mem:            3915        2116        1288          41         769        1799
  Swap:            974           0         974

fduran · 23h ago

shameless plug: you can practice this in a free VM https://docs.sadservers.com/docs/scenario-guides/practical-l... (there's a typo there to keep you on your feet)

CodeCompost · 1d ago

> At Netflix we have a massive EC2 Linux cloud

Wait a minute. I thought Netflix famously ran FreeBSD.

craftkiller · 1d ago

My understanding was their CDN ran on FreeBSD, but not their API servers. But I don't work for Netflix.

diab0lic · 1d ago

Your understanding is correct.

achierius · 22h ago

Why did they not choose to use it for both (or neither)? I.e., what reasons for using FreeBSD on CDN servers would not also apply to using them for API servers?

seabrookmx · 22h ago

They are extremely different workloads so.. everything?

The CDN servers are basically appliances, and are often embedded in various data centers (includes those ran by ISP's) to aggressively cache content. They care about high throughput and run a single workload. Being able to fine tune the entire stack, right down to the TCP/IP implementation is very valuable in this case. Since they ship the hardware and software, they can tightly integrate the two.

By contrast, API workloads are very heterogeneous. I'd have to imagine the ability to run any standard Linux software there would also be a big plus. Linux clearly has much more vetting on cloud providers than FreeBSD as well.

aflag · 22h ago

Can't you fine tune linux as well? Does FreeBSD perform better somehow on a CDN workload? I find it difficult to imagine that the reason is performance. But I don't know what the reason is.

craftkiller · 21h ago

Netflix discusses their reasons starting at 18:20: https://www.youtube.com/watch?v=veQwkG0WdN8&t=18m20s

tl;dw: the performance, the efficiency of development, the community, FreeBSD is a complete operating system, the code base is smaller, the ports system, and the license.

and this video covers the optimizations Netflix has made to FreeBSD: https://www.youtube.com/watch?v=36qZYL5RlgY

Also potentially a reason: According to drewg123, Linux's kTLS was broken. Which I see drewg123 also commenting in this thread. Is he the "Drew on my team" mentioned in the first video? Is he the speaker in the 2nd video? Idk https://news.ycombinator.com/item?id=28585008

drewg123 · 1d ago

The CDN runs FreeBSD. Linux is used for nearly everything else.

louwrentius · 1d ago

The iostat command has always been important to observe HDD/SDD latency numbers.

Especially SSDs are treated like magic storage devices with infinite IOPS at Planck-scale latency.

Until you discover that SSDs that can do 10GB/s don't do nearly so well (not even close) when you access them in a single thread with random IOPS, with queue depth of 1.

wcunning · 1d ago

That's where you start down the eBPF rabbit hole with bcc/biolatency and other block device histogram tools. Further, the cache hit rate and block size behavior of the SSD/NVME drive can really affect things if, say, your autonomous vehicle logging service uses MCAP with a chunk size much smaller than a drive block... Ask me how I know

rkachowski · 1d ago

it's 10 years later - what's the 60 second equivalent in 2025?

wcunning · 1d ago

@yankcrime posted it above: https://www.brendangregg.com/blog/2024-03-24/linux-crisis-to...

BlackLotus89 · 23h ago

PSI (pressure stall information) are missing.

I always use a configured!(F2) htop (not mentioned as well). Always enable PSI information in htop (some red hat systems I work with still don't offer them...).

If you have zfs enable those meters as well and htop has an io tab, use it!

whalesalad · 1d ago

I quite like `iotop` as an alternative to iostat. https://linux.die.net/man/1/iotop

emmelaich · 1d ago

Nice list. sar/sysstat is underrated imho.

mmh0000 · 23h ago

Oh man. There's a blast from the past.

Today, you'd want something like:

Prometheus + Node Exporter [1]

[1] https://github.com/prometheus/node_exporter

ImPostingOnHN · 1d ago

Maybe I missed it, but checking available disk space is often a good step in diagnosing misbehaving systems.

babuloseo · 1d ago

he forgot about rusttop

AnyTimeTraveler · 18h ago

I'm pretty sure that that didn't exist in 2015 ;)

LLM Inevitabilism (tomrenner.com)

Do not download the app, use the website (idiallo.com)

CARA – High precision robot dog using rope (aaedmusa.com)

Kiro: A new agentic IDE (kiro.dev)

Study mode (openai.com)

Show HN: Tinder but it's only pictures of my wife and I can only swipe right (trytender.app)

Linux Reaches 5% Desktop Market Share in USA (ostechnix.com)

EU age verification app to ban any Android system not licensed by Google (reddit.com)

Copyparty – Turn almost any device into a file server (github.com)

Dumb Pipe (dumbpipe.dev)

‘I witnessed war crimes’ in Gaza – former worker at GHF aid site [video] (bbc.com)

Enough AI copilots, we need AI HUDs (geoffreylitt.com)

Valve confirms credit card companies pressured it to delist certain adult games (pcgamer.com)

Performance and telemetry analysis of Trae IDE, ByteDance's VSCode fork (github.com)

Hyatt Hotels are using algorithmic Rest “smoking detectors” (twitter.com)

Show HN: Use Their ID – Use your local UK MP’s ID for the Online Safety Act (use-their-id.com)

M8.7 earthquake in Western Pacific, tsunami warning issued (earthquake.usgs.gov)

How to Firefox (kau.sh)

Global hack on Microsoft Sharepoint hits U.S., state agencies, researchers say (washingtonpost.com)

Reflections on OpenAI (calv.info)

Qwen3-Coder: Agentic coding in the world (qwenlm.github.io)

AI overviews cause massive drop in search clicks (arstechnica.com)

Graphene OS: a security-enhanced Android build (lwn.net)

It's time for modern CSS to kill the SPA (jonoalderson.com)

Ukrainian hackers destroyed the IT infrastructure of Russian drone manufacturer (prm.ua)

ChatGPT agent: bridging research and action (openai.com)

VPN use surges in UK as new online safety rules kick in (ft.com)

Windsurf employee #2: I was given a payout of only 1% what my shares where worth (twitter.com)

Mistral Releases Deep Research, Voice, Projects in Le Chat (mistral.ai)

Cops say criminals use a Google Pixel with GrapheneOS – I say that's freedom (androidauthority.com)

Tom Lehrer has died (nytimes.com)

The United States withdraws from UNESCO (state.gov)

Steam, Itch.io are pulling ‘porn’ games. Critics say it's a slippery slope (wired.com)

XMLUI (blog.jonudell.net)

TrackWeight: Turn your MacBook's trackpad into a digital weighing scale (github.com)

My Self-Hosting Setup (codecaptured.com)

Show HN: Shoggoth Mini – A soft tentacle robot powered by GPT-4o and RL (matthieulc.com)

Ozzy Osbourne has died (bbc.co.uk)

Coding with LLMs in the summer of 2025 – an update (antirez.com)

Rust running on every GPU (rust-gpu.github.io)

Visa and Mastercard are getting overwhelmed by gamer fury over censorship (polygon.com)

Claude Code weekly rate limits

Oakland cops gave ICE license plate data; SFPD also illegally shared with feds (sfstandard.com)

Cloudflare 1.1.1.1 Incident on July 14, 2025 (blog.cloudflare.com)

Death by AI (davebarry.substack.com)

You can now disable all AI features in Zed (zed.dev)

New colors without shooting lasers into your eyes (dynomight.net)

Complete silence is always hallucinated as "ترجمة نانسي قنقر" in Arabic (github.com)

Women dating safety app 'Tea' breached, users' IDs posted to 4chan (404media.co)

My 2.5 year old laptop can write Space Invaders in JavaScript now (GLM-4.5 Air) (simonwillison.net)

Linux Performance Analysis (2015)

Comments (38)