> How does this work? The answer is surprisingly complex.
I don't think anyone is surprised in the complexity of any explanation for that algorithm :D
divbzero · 1h ago
> Note that modern compilers like gcc or clang will produce something like is_leap_year2 from is_leap_year1, so there is not much point in doing this in C source, but it might be useful in other programming languages.
The optimizations that compilers can achieve kind of amaze me.
Indeed, the latest version of cal from util-linux keeps it simple in the C source:
I love these incomprehensible magic number optimizations. Every time I see one I wonder how many optimizations like this we missed back in the old days when we were writing all our inner loops in assembly?
It is not on the list, but #define CMP(X, Y) (((X) > (Y)) - ((X) < (Y))) is an efficient way to do generic comparisons for things that want UNIX-style comparators. If you compare the output against 0 to check for some form of greater than, less than or equality, the compiler should automatically simplify it. For example, CMP(X, Y) > 0 is simplified to (X > Y) by a compiler.
The signum(x) function that is equivalent to CMP(X, 0) can be done in 3 or 4 instructions depending on your architecture without any comparison operations:
It is such a famous example, that compilers probably optimize CMP(X, 0) to that, but I have not checked. Coincidentally, the expansion of CMP(X, 0) is on the bit hacks list.
There are a few more superoptimized mathematical operations listed here:
sometimes also known as superoptimization, which many of them also use SMT solvers like Z3 mentioned in the article
tylerhou · 1h ago
Yes, sorry, superoptimization is the correct term.
masfuerte · 8h ago
We didn't miss them. In those days they weren't optimizations. Multiplications were really expensive.
godelski · 5h ago
Related, Computerphile had a video a few months ago where they try to put compute time relative to human time, similar to the way one might visualize an atom by making the proton the size of a golfball. I think it can help put some costs into perspective and really show why branching maters as well as the great engineering done to hide some of the slowdowns. But definitely some things are being marked simply by the sheer speed of the clock (like how the small size of a proton hides how empty an atom is)
https://youtube.com/watch?v=PpaQrzoDW2I
JdeBP · 1h ago
Multiplications of this word length, one should clarify. It's not that multiplication was an inherently more expensive or different operation back then (assuming from context here that the "old days" of coding inner loops in assembly language pre-date even the 32-bit ALU era). Binary multiplication has not changed in millennia. Ancient Egyptians were using the same binary integer multiplication logic 5 millennia ago as ALUs do today.
It was that generally the fast hardware multiplication operations in ALUs didn't have very many bits in the register word length, so multiplications of wider words had to be done with library functions that did long multiplication in (say) base 256.
So this code in the headlined article would not be "three instructions" but three calls to internal helper library functions used by the compiler for long-word multiplication, comparison, and bitwise AND; not markedly more optimal than three internal helper function calls for the three original modulo operations, and in fact less optimal than the bit-twiddled modulo-powers-of-2 version found halfway down the headlined article, which would only need check the least significant byte and not call library functions for two of the 32-bit modulo operations.
Bonus points to anyone who remembers the helper function names in Microsoft BASIC's runtime library straight off the top of xyr head. It is probably a good thing that I finally seem to have forgotten them. (-: They all began with "B$" as I recall.
kurthr · 8h ago
and divides were worse. (1 cycle add, 10 cycle mult, 60 cycle div)
genewitch · 7h ago
That's fair but mod is division, or no? So realistically the new magic number version would be faster. Assuming there is 32 bit int support. Sorry, this is above my paygrade.
bobmcnamara · 4h ago
Many compiles will compute div-by-a-constant using the invert, multiply, and shift off the remainder trick. Once you have that, you can do mod-by-a-constant as a derivative and usually still beat 1-bit or 2-bit division.
Yeah, I'm thinking more of ones that remove all the divs from some crazy math functions for graphics rendering and replace them all with bit shifts or boolean ops.
dndn1 · 7h ago
If you need to know a leap year and it's before the year 6000, I made an interactive calculator and visualization [1].
It's >3 machine instructions (and I admire the mathematical tricks included in the post), but it does do thousands of calculations fairly quickly :)
Taking a look at numbers in binary reveals some interesting patterns. Although seems obvious, it was interesting to me when I realized that all prime numbers except 2 end with 1.
silisili · 2h ago
Not trying to be a jerk, but why is that interesting? Am I missing something more than all odd numbers end in 1, and primes by their nature cannot be even(except 2, as you mentioned).
lkirkwood · 1h ago
Just nitpick but all odd numbers end in an odd number, not 1, and all even numbers end in an even number i.e. a multiple of 2.
silisili · 1h ago
Sure, but OP was talking about binary representation.
nullc · 16m ago
There are many cute binary/logic tricks, if you like them be sure to read Hackers Delight and https://graphics.stanford.edu/~seander/bithacks.html . Once you've studied enough of them you'll find yourself easily coming up with more.
Warning: This may increase or decrease your popularity with fellow programmers, depending on how lucky you are in encountering problems where they make an important performance difference rather than a readability problem for people who have not deeply internalized bit twiddling.
Multiply and mask for varrious purposes is a thing I commonly use in my own code-- it's much more attractive now that it was decades ago because almost all computers we target these days have extremely fast multipliers.
These full-with logic operations and multipliers give you kind of a very parallel computer packed into a single instruction. The only problem is that it's a little tricky to program. :)
At least this one was easy to explain mechanically. Some bit hacks require p-adic numbers and other elements of number theory to explain.
22c · 7h ago
Part-way through the section on bit-twiddling, I thought to myself "Oh I wonder if we could use a solver here". Lo and behold, I was pleasantly surprised to see the author then take that exact approach. Love the attention to detail in this post!
JdeBP · 3h ago
One of these days, someone will get creative and do this sort of thing for the leap year algorithm of the Revised Julian Calendar instead of the Gregorian one.
NelsonMinar · 6h ago
It's rare to read code that makes me literally laugh out loud. What a delight.
JdeBP · 1h ago
If the author ever reads this: The hyperlink to Jacob Pratt's article has the hyperlink URL and text swapped.
hairtuq · 1h ago
Thanks, fixed.
usr1106 · 3h ago
Interesting. In one place the author argues: 0 is missing, but we already know...
The is no year 0, it goes 1 BC, 1 AD. So testing whether 0 is a leap year is moot.
skissane · 3h ago
> The is no year 0, it goes 1 BC, 1 AD. So testing whether 0 is a leap year is moot.
Which is arguably the right thing to do outside of specific domains (such as history) in which BCE is entrenched
If your software really has to display years in BCE, I think the cleanest way is store it as astronomical year numbering internally, then convert to CE/BCE on output
rf15 · 2h ago
> Astronomers use the Julian calendar for years before 1582, including the year 0, and the Gregorian calendar for years after 1582
So what happens when it's 1582? (sorry, currently no time to articulate a good wiki fix)
skissane · 55m ago
I think they use the original Gregorian cutover, in which 1582-10-04 is followed by 1582-10-15, and the dates 1582-10-05 through 1582-10-14 don’t exist.
However, in general, I think proleptic Gregorian is simpler. But in astronomy do what the astronomers do. And in history, dates between 1582 and 1923 (inclusive), you really need to explicitly mark the date as Gregorian or Julian, or have contextual information (such as the country) to determine which one to use.
1923 because that was when Greece switched from Julian to Gregorian, the last country to officially do so. Although various other countries in the Middle East and Asia adopted the Gregorian calendar more recently than 1923 - e.g. Saudi Arabia switched from the Islamic calendar to the Gregorian for most commercial purposes in 2016, and for most government purposes in 2023 - those later adoptions aren’t relevant to Julian-Gregorian cutover since they weren’t moving from Julian to Gregorian, they were moving from something non-Western to Gregorian
Large chunks of the Eastern Orthodox Church still use the Julian calendar for religious purposes; other parts theoretically use a calendar called “Revised Julian” which is identical to Gregorian until 2800 and different thereafter - although I wonder if humanity (and those churches) are still around in 2800, will they actually deviate from Gregorian at that point, or will they decide not to after all, or forget that they were officially supposed to
JdeBP · 3h ago
Go back to the start of the article, and you'll find that using the proleptic Gregorian calendar with astronomical year numbering is a premise for the algorithm.
Without that design constraint, testing for leap years becomes locale-dependent and very complex indeed.
Everything before the introduction of the gregorian calendar is moot:
"In 1582, the pope suggested that the whole of Europe skip ten days to be in sync with the new calendar. Several religious European kingdoms obeyed and jumped from October 4 to October 15."
So you cannot use any date recorded before that time for calculations.
And before that it gets even more random:
"The priests’ observations of the lunar cycles were not accurate. They also deliberately avoided leap years over superstitions. Things got worse when they started receiving bribes to declare a year longer or shorter than necessary. Some years were so long that an extra month called Intercalaris or Mercedonius was added."
usr1106 · 1h ago
Before 1582 the rule is just simpler. If it is divisible by 4 it's a leap year. So the difference is relevant for years 300, 500, 600, 700, 900 etc. For ranges spanning those years the Gregorian algorithm would result in results not matching reality.
When the Julian calendar was really adopted I don't know. Certainly not 0001-01-01. And of course it varies by country like Gregorian.
timewizard · 1h ago
ISO8601 accepts year 0. It is 1 BC in astronomical calendars. All the BC years gain a -1 offset as a result.
usr1106 · 1h ago
Interesting, how standards just ignore reality.
At work we had discussions what date format to use in our product. It's for trained users only (but not IT people), English UI only, but used on several continents. Our regulatory expert propsed ISO8601. I did not agree, because that is not used anywhere in daily life except by 8 millions Swedes. I voted 15-Apr-2025 is much less prone to human error. (None of us "won". Different formats in different places still...)
olq · 7h ago
This is gold! HN Kino! Taking the hardest problem in the world, date checking, and casually bitflipping the hell out of it. Hats off man :D
The original function is likely only going to be 3 instructions. xor, test, jne and only 1 of these is dependent on a previous instruction. In the "fast" version from the article there are 4 instructions with each depending on the previous instruction. I'm not surprised it lost in the benchmark.
degamad · 22m ago
3 instructions and a branch prediction.
crossroadsguy · 5h ago
This reminds me of once when I was giving an algo/ds interview (in Java) and the interviewer started asking me questions where one of the answers usually was “to be memorised” shit like this and he started pestering me to give him those answers (even though I said I don’t know and definitely don’t recall) and that too in C. As per him “everyone who coded knew C.. at least in college” and started becoming a bit more hostile. I think it was the first interview that I had ended as an interviewee.
And I call this thing “bit gymnastics”.
abrookewood · 4h ago
Sounds like someone desperate to prove their superiority - now imagine working with them every day.
pastage · 2h ago
It is not they way todo it agreed on that, but it can be one of the only way of getting hand wavy persons to get technical. Interviewing can be frustrating when you do not see what you expected. So no need to believe it was done toxicly.
drewg123 · 7h ago
I tend to be of the opinion that for modern general purpose CPUs in this era, such micro-optimizations are totally unnecessary because modern CPUs are so fast that instructions are almost free.
But do you know what's not free? Memory accesses[1]. So when I'm optimizing things, I focus on making things more cache friendly.
The thing is about these optimisations (assuming they test as higher performance) is that they can get applied in a library and then everyone benefits from the speedup that took some hard graft to work out. Very few people bake their own date API nowadays if they can avoid it since it already exists and techniques like this just speed up every programme whether its on the critical path or not.
codexb · 7h ago
That's basically compilers these days. It used to be that you could try and optimize your code, inline things here and there, but these days, you're not going to beat the compiler optimization.
kragen · 6h ago
That is a meme that people repeat a lot, but it turns out to be wrong:
Meanwhile, GCC will happily implement bsearch() without cmov instructions and the result will be slower than a custom implementation on which it emits cmov instructions. I do not believe anyone has filed a bug report specifically about the inefficient bsearch(), but the bug report I filed a few years ago on inefficient code generation for binary search functions is still open, so I see no point in bothering:
Eliminating comparator function overhead via inlining is also a part of the improvement, which we would not have had because the OpenZFS code is not built with LTO, so even if the compiler fixes that bug, the patch will still have been useful.
godelski · 5h ago
Definitely not true.
You aren't going to beat the compiler if you have to consider a wide range of inputs and outputs but that isn't a typical setting and you can actually beat them. Even in general settings this can be true because it's still a really hard problem for the compiler to infer things you might know. That's why C++ has all those compiler hints and why people optimize with gcc flags other than -O.
It's often easy to beat Blas in matrix multiplication if you know some conditions on that matrix. Because Blas will check to find the best algo first but might not know (you could call directly of course and there you likely won't win, but you're competing against a person not the compiler).
Never over estimate the compiler. The work the PL people do is unquestionably useful but they'll also be the first to say you can beat it.
You should always do what Knuth suggested (the often misunderstood "premature optimization" quote) and get the profile.
These days optimizing compilers are your number one enemy.
They'll "optimize" your code by deleting it. They'll "prove" your null/overflow checks are useless and just delete them. Then they'll "prove" your entire function is useless or undefined and just "optimize" it to a no-op or something. Make enough things undefined and maybe they'll turn the main function into a no-op.
In languages like C, people are well advised to disable some problematic optimizations and explicitly force the compiler to assume some implementation details to make things sane.
ryao · 5h ago
If they prove a NULL check is always false, it means you have dead code.
For example:
if (p == NULL) return;
if (p == NULL) doSomething();
It is safe to delete the second one. Even if it is not deleted, it will never be executed.
What is problematic is when they remove something like memset() right before a free operation, when the memset() is needed to sanitize sensitive data like encryption keys. There are ways of forcing compilers to retain the memset(), such as using functions designed not to be optimized out, such as explicit_bzero(). You can see how we took care of this problem in OpenZFS here:
They just think you have lots of dead code because of silly undefined behavior nonsense.
char *allocate_a_string_please(int n)
{
if (n + 1 < n)
return 0; // overflow
return malloc(n + 1); // space for the NUL
}
This code seems okay at first glance, it's a simple integer overflow check that makes sense to anyone who reads it. The addition will overflow when n equals INT_MAX, it's going to wrap around and the function will return NULL. Reasonable.
Unfortunately, we cannot have nice things because of optimizing compilers and the holy C standard.
The compiler "knows" that signed integer overflow is undefined. In practice, it just assumes that integer overflow cannot ever happen and uses this "fact" to "optimize" this program. Since signed integers "cannot" overflow, it "proves" that the condition always evaluates to false. This leads it to conclude that both the condition and the consequent are dead code.
Then it just deletes the safety check and introduces potential security vulnerabilities into the software.
They had to add literal compiler builtins to let people detect overflow conditions and make the compiler actually generate the code they want it to generate.
Fighting the compiler's assumptions and axioms gets annoying at some point and people eventually discover the mercy of compiler flags such as -fwrapv and -fno-strict-aliasing. Anyone doing systems programming with strict aliasing enabled is probably doing it wrong. Can't even cast pointers without the compiler screwing things up.
oguz-ismail · 2h ago
>if (n + 1 < n)
No one does this
drewg123 · 7h ago
To temper this slightly, these sorts of optimizations are useful on embedded CPUs for device firmware, IOT, etc. I've worked on smart NIC CPUs where cycles were so precious we'd do all kinds of crazy unreadable things.
bigiain · 6h ago
I suspect most IOT device manufacturers expect/design their device to be landfill before worrying about leap year math. (In my least optimistic moments, I suspect some of them may intentionally implement known broken algorithms that make their eWaste stop working correctly at some point in the near future that's statistically likely to bear beyond the warranty period.)
crote · 5h ago
"Is year divisible by four" will work perfectly fine for the next 75 years. Random consumer devices are definitely not going to survive that long, so being capable of dealing with it adds exactly zero value to the product.
It's hard to imagine being in a situation where the cost of the additional "year divisible by 100" check, or even the "year divisible by 400" check is too much to bear, and it's trivial enough that the developer overhead is negligible, but you never know when you need those extra handful of bytes of memory I guess...
sitzkrieg · 7h ago
on the flip side of the topic, trying to do any datetime handling on the edge of embedded compute is going to be wrong 100% of the time anyway
RaoulP · 7h ago
Would you mind elaborating?
godelski · 4h ago
> modern CPUs are so fast that instructions are almost free.
Please don't.
These things compound. You especially need to consider typical computer usage involves using more than one application at a time. There's a tragedy of the commons issue that's often ignored. It can be if you're optimizing your code (you're minimizing your share!) but it can't be if you're not.
I guarantee you we'd have a lot of faster things if people invested even a little time (these also compound :). Two great examples might be Llama.cpp and FlashAttention. Both of these have had a huge impact of people (among a number of other works) but don't get nearly the same attention as other stuff. These are popular instances but I promise you that there's a million problems like these waiting to be solved. It's just not flashy, but hey plumbers and garbagemen are pretty critical jobs too
EnPissant · 4h ago
You haven't refuted the parent comment at all. They asserted that instructions are insignificant, and other things, such as memory accesses, dominate.
kreco · 7h ago
> such micro-optimizations are totally unnecessary because modern CPUs are so fast that instructions are almost free.
I'm amazed by the fact there is always someone who will say that such optimization are totally unnecessary.
godelski · 5h ago
I'm amazed by the fact there is always someone who misinterprets Knuth's "premature optimization", reading as "don't optimize" instead of "pull out the profiler"
recursive · 6h ago
Some people have significant positions on CPU manufacturers, so there will always be at least a few.
achierius · 7h ago
Just because CPU performance is increasing faster than DRAM speeds doesn't mean that CPU performance is "free" while memory is "expensive". One thing that you're ignoring is the impact of caches and prefetching logic which have significantly improved the performance of memory-bound workloads in the last 5-10 years. DRAM might be slow, but if you avoid going out to DRAM...
More broadly, it 100% depends on your workload. You'd be surprised at how many workloads are compute-bound, even today: LLM inference might be memory bound (thus how it's possible to get remotely good performance on a CPU), but training, esp. prefill, is very much not. And once you get out of LLMs, I'd say that most applications of ML tend to be compute bound.
kragen · 6h ago
It's true that this code was optimized from 2.6ns down to 0.9ns, a saving of 1.7ns, while an L2 cache miss might be 80ns. But 1.7ns is still about 2% of the 80ns, and it's about 70% of the 2.6ns. You don't want to start optimizing by reducing things that are 2% of your cost, but 2% isn't insignificant.
The bigger issue is that probably you don't need to do leap-year checks very often so probably your leap-year check isn't the place to focus unless it's, like, sending a SQL query across a data center or something.
__turbobrew__ · 6h ago
This is why linear arrays are the fastest datastructure unless proven otherwise.
GuB-42 · 7h ago
Integer division (and modulo) is not cheap on most CPUs. Along with memory access and branch prediction, it is something worth optimizing for.
And since you are talking about memory. Code also goes in memory. Shorter code is more cache friendly.
I don't see a use case where it matters for this particular application (it doesn't mean there isn't) but well targeted micro-optimizations absolutely matter.
adonovan · 6h ago
Quite right. If you use a compiled language (e.g. Go) the difference between the two implementations is indeed negligible.
They're not free at all. They're unlikely to create noticeable latency; however, CPU clock speeds and thus power consumption haven't been constant for years now. You are paying for that lack of optimization and mobile users would thank you to do it.
There is no silver bullet.
andrepd · 7h ago
> I tend to be of the opinion that for modern general purpose CPUs in this era, such micro-optimizations are totally unnecessary because modern CPUs are so fast that instructions are almost free.
What does this mean? Free? Optimisations are totally unnecessary because... instructions are free?
The implementation in TFA is probably on the order of 5x more efficient than a naive approach. This is time and energy as well. I don't understand what "free" means in this context.
Calendar operations are performed probably trillions of times every second across all types of computers. If you can make them more time- and energy-efficient, why wouldn't you?
If there's a problem with modern software it's too much bloat, not too much optimisation.
jdlshore · 7h ago
GP made an important point that you seemed to have missed: in modern architectures, it’s much more important to minimize memory access than to minimize instructions. They weren’t saying optimization isn’t important, they were describing how to optimize on modern systems.
drewg123 · 7h ago
If this is indeed done trillions of times a second, which I frankly have a hard time believing, then sure, it might be worth it. But on a modern CPU, focusing on an optimization like this is a poor use of developer resources. There are likely several other optimizations related to cache locality that you could find in less time than it would take to do this, and those other optimizations would probably give several orders of magnitude more improvement.
Not to mention that the final code is basically a giant WTF for anybody reading it. It will be an attractive nuisance that people will be drawn to, like moths to a flame, any time there is a bug around calendar operations.
wtetzner · 7h ago
> But on a modern CPU, focusing on an optimization like this is a poor use of developer resources.
How many people are rolling their own datetime code? This seems like a totally fine optimization to put into popular datetime libraries.
andrepd · 6h ago
> There are likely several other optimizations related to cache locality that you could find in less time than it would take to do this, and those other optimizations would probably give several orders of magnitude more improvement.
How is cache / memory access relevant in a subroutine that performs a check on a 16bit number?
> Not to mention that the final code is basically a giant WTF for anybody reading it. It will be an attractive nuisance that people will be drawn to, like moths to a flame, any time there is a bug around calendar operations.
1: comments are your friend
2: a unit test can assert that this function is equivalent to the naive one in about half a millisecond.
klysm · 7h ago
I think about branches a lot too when optimizing
croes · 7h ago
Related
> The world could run on older hardware if software optimization was a priority
I didn't find any way to get a compiler to generate a branchless version. I tried clang and GCC, both for amd64, with -O0, -O5, -Os, and for clang, -Oz.
mmozeiko · 6h ago
If you change logic and/or to bitwise and/or then it'll be branchless.
kragen · 6h ago
You commented out your entire function body and the closing }. Also, on 32-bit platforms, it doesn't stop working at 65535.
captaincrunch · 5h ago
just a formatting issue on my side, there were \n.
> How does this work? The answer is surprisingly complex.
I don't think anyone is surprised in the complexity of any explanation for that algorithm :D
The optimizations that compilers can achieve kind of amaze me.
Indeed, the latest version of cal from util-linux keeps it simple in the C source:
https://github.com/util-linux/util-linux/blob/v2.41/misc-uti...Does anyone have a collection of these things?
https://graphics.stanford.edu/~seander/bithacks.html
It is not on the list, but #define CMP(X, Y) (((X) > (Y)) - ((X) < (Y))) is an efficient way to do generic comparisons for things that want UNIX-style comparators. If you compare the output against 0 to check for some form of greater than, less than or equality, the compiler should automatically simplify it. For example, CMP(X, Y) > 0 is simplified to (X > Y) by a compiler.
The signum(x) function that is equivalent to CMP(X, 0) can be done in 3 or 4 instructions depending on your architecture without any comparison operations:
https://www.cs.cornell.edu/courses/cs6120/2022sp/blog/supero...
It is such a famous example, that compilers probably optimize CMP(X, 0) to that, but I have not checked. Coincidentally, the expansion of CMP(X, 0) is on the bit hacks list.
There are a few more superoptimized mathematical operations listed here:
https://www2.cs.arizona.edu/~collberg/Teaching/553/2011/Reso...
Note that the assembly code appears to be for the Motorola 68000 processor and it makes use of flags that are set in edge cases to work.
Finally, there is a list of helpful macros for bit operations that originated in OpenSolaris (as far as I know) here:
https://github.com/freebsd/freebsd-src/blob/master/sys/cddl/...
There used to be an Open Solaris blog post on them, but Oracle has taken it down.
Enjoy!
https://en.wikipedia.org/wiki/Hacker's_Delight
It was that generally the fast hardware multiplication operations in ALUs didn't have very many bits in the register word length, so multiplications of wider words had to be done with library functions that did long multiplication in (say) base 256.
So this code in the headlined article would not be "three instructions" but three calls to internal helper library functions used by the compiler for long-word multiplication, comparison, and bitwise AND; not markedly more optimal than three internal helper function calls for the three original modulo operations, and in fact less optimal than the bit-twiddled modulo-powers-of-2 version found halfway down the headlined article, which would only need check the least significant byte and not call library functions for two of the 32-bit modulo operations.
Bonus points to anyone who remembers the helper function names in Microsoft BASIC's runtime library straight off the top of xyr head. It is probably a good thing that I finally seem to have forgotten them. (-: They all began with "B$" as I recall.
https://github.com/ridiculousfish/libdivide
It's >3 machine instructions (and I admire the mathematical tricks included in the post), but it does do thousands of calculations fairly quickly :)
[1] https://calculang.dev/examples-viewer?id=leap-year
Warning: This may increase or decrease your popularity with fellow programmers, depending on how lucky you are in encountering problems where they make an important performance difference rather than a readability problem for people who have not deeply internalized bit twiddling.
Multiply and mask for varrious purposes is a thing I commonly use in my own code-- it's much more attractive now that it was decades ago because almost all computers we target these days have extremely fast multipliers.
These full-with logic operations and multipliers give you kind of a very parallel computer packed into a single instruction. The only problem is that it's a little tricky to program. :)
At least this one was easy to explain mechanically. Some bit hacks require p-adic numbers and other elements of number theory to explain.
The is no year 0, it goes 1 BC, 1 AD. So testing whether 0 is a leap year is moot.
Not true if you use astronomical year numbering: https://en.m.wikipedia.org/wiki/Astronomical_year_numbering
Which is arguably the right thing to do outside of specific domains (such as history) in which BCE is entrenched
If your software really has to display years in BCE, I think the cleanest way is store it as astronomical year numbering internally, then convert to CE/BCE on output
So what happens when it's 1582? (sorry, currently no time to articulate a good wiki fix)
However, in general, I think proleptic Gregorian is simpler. But in astronomy do what the astronomers do. And in history, dates between 1582 and 1923 (inclusive), you really need to explicitly mark the date as Gregorian or Julian, or have contextual information (such as the country) to determine which one to use.
1923 because that was when Greece switched from Julian to Gregorian, the last country to officially do so. Although various other countries in the Middle East and Asia adopted the Gregorian calendar more recently than 1923 - e.g. Saudi Arabia switched from the Islamic calendar to the Gregorian for most commercial purposes in 2016, and for most government purposes in 2023 - those later adoptions aren’t relevant to Julian-Gregorian cutover since they weren’t moving from Julian to Gregorian, they were moving from something non-Western to Gregorian
Large chunks of the Eastern Orthodox Church still use the Julian calendar for religious purposes; other parts theoretically use a calendar called “Revised Julian” which is identical to Gregorian until 2800 and different thereafter - although I wonder if humanity (and those churches) are still around in 2800, will they actually deviate from Gregorian at that point, or will they decide not to after all, or forget that they were officially supposed to
Without that design constraint, testing for leap years becomes locale-dependent and very complex indeed.
Everything before the introduction of the gregorian calendar is moot:
"In 1582, the pope suggested that the whole of Europe skip ten days to be in sync with the new calendar. Several religious European kingdoms obeyed and jumped from October 4 to October 15."
So you cannot use any date recorded before that time for calculations.
And before that it gets even more random:
"The priests’ observations of the lunar cycles were not accurate. They also deliberately avoided leap years over superstitions. Things got worse when they started receiving bribes to declare a year longer or shorter than necessary. Some years were so long that an extra month called Intercalaris or Mercedonius was added."
When the Julian calendar was really adopted I don't know. Certainly not 0001-01-01. And of course it varies by country like Gregorian.
At work we had discussions what date format to use in our product. It's for trained users only (but not IT people), English UI only, but used on several continents. Our regulatory expert propsed ISO8601. I did not agree, because that is not used anywhere in daily life except by 8 millions Swedes. I voted 15-Apr-2025 is much less prone to human error. (None of us "won". Different formats in different places still...)
And I call this thing “bit gymnastics”.
But do you know what's not free? Memory accesses[1]. So when I'm optimizing things, I focus on making things more cache friendly.
[1] http://gec.di.uminho.pt/discip/minf/ac0102/1000gap_proc-mem_...
https://cr.yp.to/talks/2015.04.16/slides-djb-20150416-a4.pdf (though see https://blog.regehr.org/archives/1515: "This piece (...) explains why Daniel J. Bernstein’s talk, The death of optimizing compilers (audio [http://cr.yp.to/talks/2015.04.16/audio.ogg]) is wrong", citing https://news.ycombinator.com/item?id=9397169)
https://blog.royalsloth.eu/posts/the-compiler-will-optimize-...
http://lua-users.org/lists/lua-l/2011-02/msg00742.html
https://web.archive.org/web/20150213004932/http://x264dev.mu...
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110001
Binary searches on OpenZFS B-Tree nodes are faster in part because we did not wait for the compiler:
https://github.com/openzfs/zfs/commit/677c6f8457943fe5b56d7a...
Eliminating comparator function overhead via inlining is also a part of the improvement, which we would not have had because the OpenZFS code is not built with LTO, so even if the compiler fixes that bug, the patch will still have been useful.
You aren't going to beat the compiler if you have to consider a wide range of inputs and outputs but that isn't a typical setting and you can actually beat them. Even in general settings this can be true because it's still a really hard problem for the compiler to infer things you might know. That's why C++ has all those compiler hints and why people optimize with gcc flags other than -O.
It's often easy to beat Blas in matrix multiplication if you know some conditions on that matrix. Because Blas will check to find the best algo first but might not know (you could call directly of course and there you likely won't win, but you're competing against a person not the compiler).
Never over estimate the compiler. The work the PL people do is unquestionably useful but they'll also be the first to say you can beat it.
You should always do what Knuth suggested (the often misunderstood "premature optimization" quote) and get the profile.
They'll "optimize" your code by deleting it. They'll "prove" your null/overflow checks are useless and just delete them. Then they'll "prove" your entire function is useless or undefined and just "optimize" it to a no-op or something. Make enough things undefined and maybe they'll turn the main function into a no-op.
In languages like C, people are well advised to disable some problematic optimizations and explicitly force the compiler to assume some implementation details to make things sane.
For example:
It is safe to delete the second one. Even if it is not deleted, it will never be executed.What is problematic is when they remove something like memset() right before a free operation, when the memset() is needed to sanitize sensitive data like encryption keys. There are ways of forcing compilers to retain the memset(), such as using functions designed not to be optimized out, such as explicit_bzero(). You can see how we took care of this problem in OpenZFS here:
https://github.com/openzfs/zfs/pull/14544
Unfortunately, we cannot have nice things because of optimizing compilers and the holy C standard.
The compiler "knows" that signed integer overflow is undefined. In practice, it just assumes that integer overflow cannot ever happen and uses this "fact" to "optimize" this program. Since signed integers "cannot" overflow, it "proves" that the condition always evaluates to false. This leads it to conclude that both the condition and the consequent are dead code.
Then it just deletes the safety check and introduces potential security vulnerabilities into the software.
They had to add literal compiler builtins to let people detect overflow conditions and make the compiler actually generate the code they want it to generate.
Fighting the compiler's assumptions and axioms gets annoying at some point and people eventually discover the mercy of compiler flags such as -fwrapv and -fno-strict-aliasing. Anyone doing systems programming with strict aliasing enabled is probably doing it wrong. Can't even cast pointers without the compiler screwing things up.
No one does this
It's hard to imagine being in a situation where the cost of the additional "year divisible by 100" check, or even the "year divisible by 400" check is too much to bear, and it's trivial enough that the developer overhead is negligible, but you never know when you need those extra handful of bytes of memory I guess...
These things compound. You especially need to consider typical computer usage involves using more than one application at a time. There's a tragedy of the commons issue that's often ignored. It can be if you're optimizing your code (you're minimizing your share!) but it can't be if you're not.
I guarantee you we'd have a lot of faster things if people invested even a little time (these also compound :). Two great examples might be Llama.cpp and FlashAttention. Both of these have had a huge impact of people (among a number of other works) but don't get nearly the same attention as other stuff. These are popular instances but I promise you that there's a million problems like these waiting to be solved. It's just not flashy, but hey plumbers and garbagemen are pretty critical jobs too
I'm amazed by the fact there is always someone who will say that such optimization are totally unnecessary.
More broadly, it 100% depends on your workload. You'd be surprised at how many workloads are compute-bound, even today: LLM inference might be memory bound (thus how it's possible to get remotely good performance on a CPU), but training, esp. prefill, is very much not. And once you get out of LLMs, I'd say that most applications of ML tend to be compute bound.
The bigger issue is that probably you don't need to do leap-year checks very often so probably your leap-year check isn't the place to focus unless it's, like, sending a SQL query across a data center or something.
And since you are talking about memory. Code also goes in memory. Shorter code is more cache friendly.
I don't see a use case where it matters for this particular application (it doesn't mean there isn't) but well targeted micro-optimizations absolutely matter.
https://go.dev/play/p/i72xCyhqRkC
There is no silver bullet.
What does this mean? Free? Optimisations are totally unnecessary because... instructions are free?
The implementation in TFA is probably on the order of 5x more efficient than a naive approach. This is time and energy as well. I don't understand what "free" means in this context.
Calendar operations are performed probably trillions of times every second across all types of computers. If you can make them more time- and energy-efficient, why wouldn't you?
If there's a problem with modern software it's too much bloat, not too much optimisation.
Not to mention that the final code is basically a giant WTF for anybody reading it. It will be an attractive nuisance that people will be drawn to, like moths to a flame, any time there is a bug around calendar operations.
How many people are rolling their own datetime code? This seems like a totally fine optimization to put into popular datetime libraries.
How is cache / memory access relevant in a subroutine that performs a check on a 16bit number?
> Not to mention that the final code is basically a giant WTF for anybody reading it. It will be an attractive nuisance that people will be drawn to, like moths to a flame, any time there is a bug around calendar operations.
1: comments are your friend
2: a unit test can assert that this function is equivalent to the naive one in about half a millisecond.
> The world could run on older hardware if software optimization was a priority
https://news.ycombinator.com/item?id=43971464
bool is_leap_year(uint32_t y) { // Works for Gregorian years in range [0, 65535] return ((!(y & 3)) && ((y % 25 != 0) || !(y & 15))); }
I didn't find any way to get a compiler to generate a branchless version. I tried clang and GCC, both for amd64, with -O0, -O5, -Os, and for clang, -Oz.