Not sure if that’s relevant, but when I do micro-benchmarks like that measuring time intervals way smaller than 1 second, I use __rdtsc() compiler intrinsic instead of standard library functions.
On all modern processors, that instruction measures wallclock time with a counter which increments at the base frequency of the CPU unaffected by dynamic frequency scaling.
Apart from the great resolution, that time measuring method has an upside of being very cheap, couple orders of magnitude faster than an OS kernel call.
sa46 · 2h ago
Isn't gettimeofday implemented with vDSO to avoid kernel context switching (and therefore, most of the overhead)?
My understanding is that using tsc directly is tricky. The rate might not be constant, and the rate differs across cores. [1]
I think most current systems have invariant tsc, I skimmed your article and was surprised to see an offset (but not totally shocked), but the rate looked the same.
You could cpu pin the thread that's reading the tsc, except you can't pin threads in OpenBSD :p
wahern · 54m ago
But just to be clear (for others), you don't need to do that because using RDTSC/RDTSCP is exactly how gettimeofday and clock_gettime work these days, even on OpenBSD. Where using the TSC is practical and reliable, the optimization is already there.
OpenBSD actually only implemented this optimization relatively recently. Though most TSCs will be invariant, they still need to be trained across cores, and there are other minutiae (sleeping states?) that made it a PITA to implement in a reliable way, and OpenBSD doesn't have as much manpower as Linux. Some of those non-obvious issues would be relevant to someone trying to do this manually, unless they could rely on their specific hardware behavior.
Dylan16807 · 1h ago
If you have something newer than a pentium 4 the rate will be constant.
I'm not sure of the details for when cores end up with different numbers.
quotemstr · 1h ago
Wizardly workarounds for broken APIs persist long after those APIs are fixed. People still avoid things like flock(2) because at one time NFS didn't handle file locking well. CLOCK_MONOTONIC_RAW is fine these days with the vDSO.
sugarpimpdorsey · 8m ago
OpenBSD is many things, but 'fast' is not a word that comes to mind.
Lightweight? Yes.
Minimalist? Definitely.
Compact? Sure.
But fast? No.
Would I host a database or fileserver on OpenBSD? Hell no.
Boot times seem to take as long as they did 20 years ago. They are also advocates for every schizo security mitigation they can dream up that sacrifices speed and that's ok too.
But let's not pretend it's something it's not.
sweetjuly · 2h ago
A better title: a pathological test program meant for Linux does not trigger pathological behavior on OpenBSD
apgwoz · 2h ago
Surely you must be new to tedu posts…
ameliaquining · 2h ago
Still worth avoiding having the HN thread be about whether OpenBSD is in general faster than Linux. This is a thing I've seen a bunch of times recently, where someone gives an attention-grabbing headline to a post that's actually about a narrower and more interesting technical topic, but then in the comments everyone ignores the content and argues about the headline.
st_goliath · 23m ago
> This is a thing I've seen a bunch of times recently ...
> ... in the comments everyone ignores the content and argues about the headline.
Surely you must be new to Hacker News…
rurban · 4h ago
No, generally Linux is at least 3x faster than OpenBSD, because they don't care much for optimizations.
farhaven · 2h ago
OpenBSD is a lot faster in some specialized areas though. Random number generation from `/dev/urandom`, for example. When I was at university (in 2010 or so), it was faster to read `/dev/urandom` on my OpenBSD laptop and pipe it over ethernet to a friend's Linux laptop than running `cat /dev/urandom > /dev/sda` directly on his.
Not by just a bit, but it was a difference between 10MB/s and 100MB/s.
sillystuff · 2h ago
I think you meant to say /dev/random, not /dev/urandom.
/dev/random, on linux used to stall waiting for entropy from sources of randomness like network jitter, mouse movement, keyboard typing. /dev/urandom has always been fast on Linux.
Today, linux /dev/random mainly uses an RNG after initial seeding. The BSDs always did this. On my laptop, I get over 500MB/s (kernel 6.12) .
IIRC, on modern linux kernels, /dev/urandom is now just an alias to /dev/random for backward compatibility.
tptacek · 1h ago
There's no reason for normal userland code not part of the distribution itself ever to use /dev/random, and getrandom(2) with GRND_RANDOM unset is probably the right answer for everything.
Both Linux and BSD use a CSPRNG to satisfy /dev/{urandom,random} and getrandom, and, for future-secrecy/compromise-protection continually update their entropy pools with hashed high-entropy events (there's ~essentially no practical cryptographic reason a "seeded" CSPRNG ever needs to be rekeyed, but there are practical systems security reasons to do it).
sgarland · 43m ago
OpenBSD switched their PRNG to arc4random in 2012 (and then ChaCha20 in 2014); depending on how accurate your time estimate is, that could well have been the cause. Linux switched to ChaCha20 in 2016.
Related, I stumbled down a rabbit hole of PRNGs last year when I discovered [0] that my Mac was way faster at generating UUIDs than my Linux server, even taking architecture and clock speed into account. Turns out glibc didn’t get arc4random until 2.36, and the version of Debian I had at the time didn’t have 2.36. In contrast, since MacOS is BSD-based, it’s had it for quite some time.
/dev/urandom isn't a great test, IMO, simply because there are reasonable tradeoffs in security v speed.
For all I know BSD could be doing 31*last or something similar.
The algorithm is also free to change.
chowells · 2h ago
Um... This conversation is about OpenBSD, making that objection incredibly funny. OpenBSD has a mostly-deserved reputation for doing the correct security thing first, in all cases.
But that's also why the rng stuff was so much faster. There was a long period of time where the Linux dev in charge of randomness believed a lot of voodoo instead of actual security practices, and chose nonsense slow systems instead of well-researched fast ones. Linux has finally moved into the modern era, but there was a long period where the randomness features were far inferior to systems built by people with a security background.
tptacek · 1h ago
OpenBSD isn't meaningfully more secure than Linux. It probably was 20 years ago. Today it's more accurate to say that Linux and OpenBSD have pursued different security strategies --- there are meaningful differences, but they aren't on a simple one-dimensional spectrum of "good" to "bad".
(I was involved, somewhat peripherally, in OpenBSD security during the era of the big OpenBSD Security Audit).
somat · 1h ago
At one point probably 10 years ago I had linux vm guests refuse to generate gpg keys, gpg insisted it needed the stupid blocking random device, and because the vm guest was not getting any "entropy" the process went nowhere. As an openbsd user naturally I was disgusted, there are many sane solutions to this problem, but I used none of them. Instead I found rngd a service to accept "entropy" from a network source and blasted it with the /dev/random from a fresh obsd guest on the same vm host. Mainly out of spite. "look here you little shit, this is how you generate random numbers"
Interesting. I tried to follow the discussion in the linked thread, and the only takeaway I got was "something to do with RCU". What id the simplified explanation?
bobby_big_balls · 1h ago
In Linux, the file descriptor table (fdtable) of a process starts with a minimum of 256 slots. Two threads creating 256 sockets each, which uses 512 fds on top of the three already present (for stdin, stdout and stderr), requires that the fdtable be expanded about halfway through when the capacity is doubled from 256 to 512, and again near the end when resizing from 512 to 1024.
This is done by expand_fdtable() in the kernel. It contains the following code:
if (atomic_read(&files->count) > 1)
synchronize_rcu();
The field files->count is a reference counter. As there are two threads, which share a set of open files between them, the value of this is 2, meaning that synchronize_rcu() is called here during fdtable expansion. This waits until a full RCU grace period has elapsed, causing a delay in acquiring a new fd for the socket currently being created.
If the fdtable is expanded prior to creating a new thread, as the test program optionally will do by calling dup(0, 666) if supplied a command line argument, this avoids the synchronize_rcu() call because at this point files->count == 1. Therefore, if this is done, there will be no delay later on when creating all the sockets as the fdtable will have sufficient capacity.
By contrast, the OpenBSD kernel doesn't have anything like RCU and just uses a rwlock when the file descriptor table of the process is being modified, avoiding the long delay during expansion that may be observed in Linux.
tptacek · 1h ago
RCUs are super interesting; here's (I think I've got the right link) a good talk on how they work and why they work that way:
Thanks for the explanation. I confirmed the performance timing different by enabling the dup call.
I guess my question is why would synchronize_rcu take many milliseconds (20+) to run. I would expect that to be in the very low milliseconds or less.
altairprime · 51m ago
> allocating kernel objects from proper virtual memory makes this easier. Linux currently just allocates kernel objects straight out of the linear mapping of all physical memory
I found this to be a key takeaway of reading the full thread: this is, in part, a benchmark of kernel memory allocation approaches, that surfaces an unforeseen difference in FD performance at a mere 256 x 2 allocs. Presumably we’re seeing a test case distilled down from a real world scenario where this slowdown was traced for some reason?
saagarjha · 40m ago
That’s how they’re designed; they are intended to complete at some point that’s not soon. There’s an “expedited RCU” which to my understanding tries to get everyone past the barrier as fast as possible by yelling at them but I don’t know if that would be appropriate here.
viraptor · 2h ago
When 2 threads are allocating sockets sequentially, they fight for the locks. If you preallocate a bigger table by creating fd 666 first, the lock contention goes away.
JdeBP · 1h ago
It's something that has always been interesting about Windows NT, which has a multi-level object handle table, and does not have the rule about re-using the lowest numbered available table index. There's scope for reducing contention amongst threads in such an architecture.
Although:
1. back in application-mode code the language runtime libraries make things look like a POSIX API and maintain their own table mapping object handles to POSIX-like file descriptors, where there is the old contention over the lowest free entries; and
1. in practice the object handle table seems to mostly append, so multiple object-opening threads all contend over the end of the table.
saagarjha · 55m ago
RCU is very explicitly a lockless synchronization strategy.
themafia · 2h ago
Yea, well, I had to modify your website to make it readable. Why do people do this?
GTP · 1h ago
By leaving my finger on the screen, I accidentally triggered an easter egg of two "cannons" shooting squares. Did anyone else notice it?
pan69 · 1h ago
Happened for me on my normal desktop browser, cute but distracting. It also made my mouse cursor disappear. I had to move my mouse outside the browser window to make it visible again.
evanjrowley · 1h ago
I also saw it, and it happened on a non-touch computer screen.
haunter · 3h ago
In my
mind faster = the same game with the same graphics settings have more FPS
(I don’t even know you could actually start mainstream games on BSD or not)
nine_k · 3h ago
Isn't it mostly limited by GPU hardware, and by binary blobs that are largely independent from the host platform?
haunter · 2h ago
Games run better under Linux (even if they are not-native but with Proton/Wine)
than on Windows 11 so the platform does matter
It annoys me when people claim this. It depends on the game, distro, proton version, what desktop environment, plus a lot of other things I have forgotten about.
Also latency is frequently worse on Linux. I play a lot of quick twitch games on Linux and Windows and while fps and frame times are generally in the same ballpark, latency is far higher.
Other problems is that proton compatibility is all over the place. Some of the games valve said were certified don't actually work well, mods can be problematic, and generally you end up faffing with custom launch options to get things working well.
zelphirkalt · 2h ago
Many of those games mysteriously fail to work for me, almost like Proton has a problem on my system in general and I am unable to figure it out. However, in the past I got games that are made for Windows to work better on WINE than on Windows. One of those games is Starcraft 2 when it came out. On Windows it would always freeze in one movie/sequence of the single player campaign, which made it actually unplayable on Windows, while after some trial and error, I managed to get a fully working game on GNU/Linux, and was able to finish the campaign.
This goes to show, that the experience with Proton and different hardware and whatever it is in system configuration is highly individual, but also, that games can indeed run better using WINE or Proton than on the system they were made for.
extraisland · 1h ago
Consistency is better than any theoretical FPS improvements IMO.
Often for games that don't work with modern Windows there are fan patches/mods that fix these issues.
For games that are modern frequently have weird framerate issues that rarely happen on Windows. When I am playing a multiplayer, fast twitch game I don't want the framerate to randomly dip.
I was gaming exclusively on Linux from 2019 and gave up earlier this year. I wanted to play Red Alert 2 and trying to work out what to with Wine and all the other stuff was a PITA. It was all easy on Windows.
agambrahma · 2h ago
So ... essentially testing file descriptor allocation overhead
asveikau · 2h ago
My guess is it has something to do with the file descriptor table having a lot of empty entries (the dup2(0, 666) line.)
Now time to read the actual linked discussion.
wahern · 48m ago
I think dup2 is the hint, but in the example case the dup2 path isn't invoked--it's conditioned on passing an argument, but the test runs are just `./a.out`. IIUC, the issue is growing the file descriptor table. The dup2 is a workaround that preallocates a larger table (666 > 256 * 2)[1], to avoid the pathological case when a multi-threaded process grows the table. From the linked infosec.exchange discussion it seems the RCU-based approach Linux is using can result in some significant latency, resulting in much worse performance in simple cases like this compared to a simple mutex[2].
[1] Off-by-one. To be more precise, the state established by the dup2 is (667 > 256 * 2), or rather (667 > 3 + 256 * 2).
[2] Presumably what OpenBSD is using. I'd be surprised if they've already imported and adopted FreeBSD's approach mentioned in the linked discussion, notwithstanding that OpenBSD has been on an MP scalability tear the past few years.
the bsd people seem to enjoy measuring and logging a lot.
jedberg · 3h ago
"It depends"
Faster is all relative. What are you doing? Is it networking? Then BSD is probably faster than Linux. Is it something Linux is optimized for? Then probably Linux.
A general benchmark? Who knows, but does it really matter?
At the end of the day, you should benchmark your own workload, but also it's important to realize that in this day and age, it's almost never the OS that is the bottleneck. It's almost always a remote network call.
M_r_R_o_b_o_t_ · 1h ago
Ye
znpy · 2h ago
the first step in benchmarking software is to use the same hardware.
the author failed the first step.
everything that follows is then garbage.
saagarjha · 52m ago
You do understand that people who know how to benchmark things don’t actually need to conform to the rules of thumb that are given to non-experts so they don’t shoot themselves in the foot, right? Do you also write off rally drivers because they have their feet on both pedals?
the_plus_one · 3h ago
Is it just me, or is there some kind of asteroid game shooting bullets at my cursor while I try to read this [1]? I hate to sound mean, but it's a bit distracting. I guess it's my fault for having JavaScript enabled.
It's extremely distracting. I'm not normally one to have issues that require reduced motion, but the asteroids are almost distracting enough on their own, and the fact that it causes my cursor to vanish is a real accessibility issue. I didn't actually realize just how much I use my mouse cursor when reading stuff until now, partly as a fidget, partly as a controllable visual anchor as my eyes scan the page.
joemi · 40m ago
I actually can't read things on that site at all. I move my mouse around while reading, not necessarily near the words I'm currently reading, so when my mouse disappears it's haltingly distracting. In addition to that, the way the "game" visually interferes with the text that I'm trying to read makes it incredibly hard to focus on reading. These two things combine to make this site literally unreadable for me.
I don't get why people keep posting and upvoting articles from this user-hostile site.
binarycrusader · 22m ago
I found it exceedingly difficult to read, so I ended up applying these ublock filter rules so I could read it:
Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.
Because way more people have opinions about e.g. asteroid game scripts on web pages than have opinions on RCUs, these subthreads spread like kudzu.
ummonk · 53m ago
The described behavior sounds like significantly worse than tangential annoyance, and isn’t really a common occurrence even on modern user-hostile websites.
Jtsummers · 7m ago
He used to have a loading screen that did nothing if you have JS enabled in your browser, but no loading screen (which, again, did nothing) if you had JS disabled. I'm pretty sure it's meant to deliberately annoy, though this one is less annoying than the loading screen was.
nomel · 2h ago
And, if it hits, your cursor disappears! I wish there was some explosion.
bigstrat2003 · 2h ago
No, it's the website's fault for doing stupid cutesy stuff that makes the page harder to read. Don't victim-blame yourself here.
stavros · 1h ago
I really don't understand this "everything must be 100% serious all the time". Why is it stupid?
apodik · 1h ago
I generally think stuff like that make the web much more interesting.
In this case it was distracting though.
Gualdrapo · 1h ago
The HN hivemind decries the lack of humanity and personality of the internet of nowadays but at the same time wants every website to be 100% text, no JS, no CSS because allegedly nobody needs CSS and, if you dare to do something remotely "fancy" with the layout, you have to build it with <table>s.
On all modern processors, that instruction measures wallclock time with a counter which increments at the base frequency of the CPU unaffected by dynamic frequency scaling.
Apart from the great resolution, that time measuring method has an upside of being very cheap, couple orders of magnitude faster than an OS kernel call.
My understanding is that using tsc directly is tricky. The rate might not be constant, and the rate differs across cores. [1]
[1]: https://www.pingcap.com/blog/how-we-trace-a-kv-database-with...
You could cpu pin the thread that's reading the tsc, except you can't pin threads in OpenBSD :p
OpenBSD actually only implemented this optimization relatively recently. Though most TSCs will be invariant, they still need to be trained across cores, and there are other minutiae (sleeping states?) that made it a PITA to implement in a reliable way, and OpenBSD doesn't have as much manpower as Linux. Some of those non-obvious issues would be relevant to someone trying to do this manually, unless they could rely on their specific hardware behavior.
I'm not sure of the details for when cores end up with different numbers.
Lightweight? Yes.
Minimalist? Definitely.
Compact? Sure.
But fast? No.
Would I host a database or fileserver on OpenBSD? Hell no.
Boot times seem to take as long as they did 20 years ago. They are also advocates for every schizo security mitigation they can dream up that sacrifices speed and that's ok too.
But let's not pretend it's something it's not.
> ... in the comments everyone ignores the content and argues about the headline.
Surely you must be new to Hacker News…
Not by just a bit, but it was a difference between 10MB/s and 100MB/s.
/dev/random, on linux used to stall waiting for entropy from sources of randomness like network jitter, mouse movement, keyboard typing. /dev/urandom has always been fast on Linux.
Today, linux /dev/random mainly uses an RNG after initial seeding. The BSDs always did this. On my laptop, I get over 500MB/s (kernel 6.12) .
IIRC, on modern linux kernels, /dev/urandom is now just an alias to /dev/random for backward compatibility.
Both Linux and BSD use a CSPRNG to satisfy /dev/{urandom,random} and getrandom, and, for future-secrecy/compromise-protection continually update their entropy pools with hashed high-entropy events (there's ~essentially no practical cryptographic reason a "seeded" CSPRNG ever needs to be rekeyed, but there are practical systems security reasons to do it).
Related, I stumbled down a rabbit hole of PRNGs last year when I discovered [0] that my Mac was way faster at generating UUIDs than my Linux server, even taking architecture and clock speed into account. Turns out glibc didn’t get arc4random until 2.36, and the version of Debian I had at the time didn’t have 2.36. In contrast, since MacOS is BSD-based, it’s had it for quite some time.
[0]: https://gist.github.com/stephanGarland/f6b7a13585c0caf9eb64b...
For all I know BSD could be doing 31*last or something similar.
The algorithm is also free to change.
But that's also why the rng stuff was so much faster. There was a long period of time where the Linux dev in charge of randomness believed a lot of voodoo instead of actual security practices, and chose nonsense slow systems instead of well-researched fast ones. Linux has finally moved into the modern era, but there was a long period where the randomness features were far inferior to systems built by people with a security background.
(I was involved, somewhat peripherally, in OpenBSD security during the era of the big OpenBSD Security Audit).
This is done by expand_fdtable() in the kernel. It contains the following code:
The field files->count is a reference counter. As there are two threads, which share a set of open files between them, the value of this is 2, meaning that synchronize_rcu() is called here during fdtable expansion. This waits until a full RCU grace period has elapsed, causing a delay in acquiring a new fd for the socket currently being created.If the fdtable is expanded prior to creating a new thread, as the test program optionally will do by calling dup(0, 666) if supplied a command line argument, this avoids the synchronize_rcu() call because at this point files->count == 1. Therefore, if this is done, there will be no delay later on when creating all the sockets as the fdtable will have sufficient capacity.
By contrast, the OpenBSD kernel doesn't have anything like RCU and just uses a rwlock when the file descriptor table of the process is being modified, avoiding the long delay during expansion that may be observed in Linux.
https://www.youtube.com/watch?v=9rNVyyPjoC4
I guess my question is why would synchronize_rcu take many milliseconds (20+) to run. I would expect that to be in the very low milliseconds or less.
I found this to be a key takeaway of reading the full thread: this is, in part, a benchmark of kernel memory allocation approaches, that surfaces an unforeseen difference in FD performance at a mere 256 x 2 allocs. Presumably we’re seeing a test case distilled down from a real world scenario where this slowdown was traced for some reason?
Although:
1. back in application-mode code the language runtime libraries make things look like a POSIX API and maintain their own table mapping object handles to POSIX-like file descriptors, where there is the old contention over the lowest free entries; and 1. in practice the object handle table seems to mostly append, so multiple object-opening threads all contend over the end of the table.
(I don’t even know you could actually start mainstream games on BSD or not)
https://news.ycombinator.com/item?id=44381144
Also latency is frequently worse on Linux. I play a lot of quick twitch games on Linux and Windows and while fps and frame times are generally in the same ballpark, latency is far higher.
Other problems is that proton compatibility is all over the place. Some of the games valve said were certified don't actually work well, mods can be problematic, and generally you end up faffing with custom launch options to get things working well.
This goes to show, that the experience with Proton and different hardware and whatever it is in system configuration is highly individual, but also, that games can indeed run better using WINE or Proton than on the system they were made for.
Often for games that don't work with modern Windows there are fan patches/mods that fix these issues.
For games that are modern frequently have weird framerate issues that rarely happen on Windows. When I am playing a multiplayer, fast twitch game I don't want the framerate to randomly dip.
I was gaming exclusively on Linux from 2019 and gave up earlier this year. I wanted to play Red Alert 2 and trying to work out what to with Wine and all the other stuff was a PITA. It was all easy on Windows.
Now time to read the actual linked discussion.
[1] Off-by-one. To be more precise, the state established by the dup2 is (667 > 256 * 2), or rather (667 > 3 + 256 * 2).
[2] Presumably what OpenBSD is using. I'd be surprised if they've already imported and adopted FreeBSD's approach mentioned in the linked discussion, notwithstanding that OpenBSD has been on an MP scalability tear the past few years.
Faster is all relative. What are you doing? Is it networking? Then BSD is probably faster than Linux. Is it something Linux is optimized for? Then probably Linux.
A general benchmark? Who knows, but does it really matter?
At the end of the day, you should benchmark your own workload, but also it's important to realize that in this day and age, it's almost never the OS that is the bottleneck. It's almost always a remote network call.
the author failed the first step.
everything that follows is then garbage.
[1]: https://flak.tedunangst.com/script.js
I don't get why people keep posting and upvoting articles from this user-hostile site.
https://news.ycombinator.com/newsguidelines.html
Because way more people have opinions about e.g. asteroid game scripts on web pages than have opinions on RCUs, these subthreads spread like kudzu.
In this case it was distracting though.