For completeness, this description of alignment is misleading:
> Well, dear reader, this padding is added because the CPU needs memory to be aligned in sets of 4 bytes because it’s optimized in that fashion.
> ...
> Remember: since structs are aligned to 4 bytes, any padding is therefore unnecessary if the size of the struct is a multiple of 4 without the padding.
Individual data types have their own alignment (e.g., `bool`/`char` may be 1, `short` may be 2, `int` may be 4, `long` may be 8, etc.), and the alignment of a compound type (like a struct) defaults to the maximum alignment of its constituent types.
In this article, `struct Monster` has an alignment of 4 because `int` and `float` have an alignment of 4 for the author's configuration. Expanding one of the `int`s to a `long` could increase the alignment to 8 on some CPUs, and removing the `int` and `float` fields would decrease the alignment to 1 for most CPUs.
sumtechguy · 16h ago
Also keep in mind that is also all very CPU and compiler specific. Had one compiler where it packed everything at 4/8, usually 8. Not the 1/2/4/8 you would expect. That was because the CPU would just seg fault if you didnt play nice with the data access. The compiler hid a lot of it if you set the packing with offsets and mem moves and shifting. It was clever but slow. So they by default picked a wide enough packing that removed the extra instructions at the cost of using more memory. x86 was by far the most forgiving while at the time I was doing it. ARM was the least forgiving (at least on the platform I was using). With MIPS being OK in some cases but not others.
jandrewrogers · 8h ago
Some of the Cray hardware was basically pure 64-bit. The systems largely didn’t recognize smaller granularity. I learned a lot of lessons about writing portable C by writing code for Cray systems.
gavinsyancey · 14h ago
Um ... isn't alignment generally dictated by the platform ABI so that programs compiled by different compilers can be linked together?
plorkyeran · 14h ago
The widely used platforms with multiple compilers generally have one or more written down ABIs that the compilers all follow, but more niche platforms frequently have exactly one compiler (often a very out of date fork of gcc) that just does whatever they felt like implementing and may not even support linking together things built by different versions of that one compiler.
bobmcnamara · 9h ago
Ideally yes, but practically there are at least a dozen just for x86.(there's like 3 big ones).
boomlinde · 25m ago
> Individual data types have their own alignment (e.g., `bool`/`char` may be 1, `short` may be 2, `int` may be 4, `long` may be 8, etc.), and the alignment of a compound type (like a struct) defaults to the maximum alignment of its constituent types.
I will add that this is implementation defined. IIRC the only restriction the standard imposes on the alignment of a struct is that a pointer to it is also a pointer to its first member when converted, meaning its alignment must practically be a multiple of that of its first field.
NooneAtAll3 · 17m ago
implementation-defined means your specialized platform can be supported without needing to conform - it does not mean that common knowledge is false for common users
boomlinde · 3m ago
"Implementation-defined" means that there is nothing to conform to. I have not claimed that "common knowledge is false for common users" or anything to that effect. My comment is additive, which should have been clear to anyone reading the first three words of it.
flohofwoe · 5h ago
AFAIK alignment doesn't even matter anymore (for CPU data at least) since the 'word size' of a modern CPU is the size of a cache line (32 or 64 bytes?), e.g. unaligned accesses within a 32 or 64 byte block are not different than aligned accesses.
(technically there is still an advantage of items aligned to their size in that such an item can never straddle adjacent cache lines though)
And there's also still tons of different alignment requirements when working with GPU data - and interestingly those alignment requirements may differ from C's alignment rules, so you may need to explicitly use packed structs (which are still not a standard C feature!) with manual padding.
birn559 · 4h ago
My understanding is that C++ compilers still add padding by default for performance reasons. CPU will have to spend a few cycles to reorganize data that is not aligned in chunks of 4 bytes.
xxs · 2h ago
> CPU will have to spend a few cycles to reorganize data that is not aligned in chunks of 4 bytes.
That's not true for quite a lot of CPUs. Pretty much all x64 and stuff don't care
flohofwoe · 4h ago
Daniel Lemire did some measuring ~~recently~~ (oops, in 2012):
TL;DR: 10% difference on what in 2012 was a low-end CPU, no difference on "new in 2012" CPUs. So my guess is that by now it really doesn't matter anymore :)
birn559 · 49s ago
Wasn't aware of that, thanks for the link!
kazinator · 10h ago
I.e. the author is wrong; struct s { char x; }; is not required to be four-byte-aligned. It can have a 1 byte size, and thus alignment.
bobmcnamara · 8h ago
And bool may be 4, and char may be 2(m68k)!
flohofwoe · 5h ago
Not unless you're stuck in the 90s ;) These sizes have all been standardized to 1 byte since C99.
bobmcnamara · 2m ago
Not the sizes, the alignments.
david2ndaccount · 19h ago
The author relies on the compiler fitting the bitfield in unused padding bytes after “speed”, but that is implementation-defined behavior (almost everything about bitfields is implementation defined). MSVC will align the unsigned bitfield to alignof(unsigned), whereas GCC and clang will use the padding bytes. To portably pack the struct, use unsigned integers and use flags (monster.flags & CAN_FLY instead of monster.can_fly).
Bitfields have other problems. Say you have two bitfields each of `uint8_t` type and totaling 16 bits: well, you might think that's _two_ fields, but the compiler is allowed to treat them as _one_ whenever it wants to, and _that_ can be a problem if you're accessing one with atomics and the other without because the other can yield unsynchronized accesses to the one that needs synchronization.
Bitfields in C leave way too much behavior to the compiler or undefined. It's really intolerable.
atiedebee · 2h ago
Even worse: bit fields can only be applied to int, signed int and unsigned int (maybe char as well but i dont remember)
Even crazier is the fact that an int bitfield's signedness is implementation defined
arjvik · 18h ago
Why can't there be a standard defined for bitfields in future C releases? This is a long-discussed drawback of a feature that I really really want to be able to use in production code.
petermclean · 15h ago
I've been working on an open source library that, while it doesn't solve bitfields, can provide convenient ergonomic access to bit-granular types:
You can inherit from bit_array and create a pseudo bitfield:
class mybits_t : bit::bit_array<17> {
public:
// slice a 1 bit sized field
// returns a proxy reference to the single bit
auto field1() {
return (*this)(0, 1);
}
// slice a 3 bit sized field
// returns a proxy reference to the 3 bits
auto field2() {
return (*this)(1, 4);
}
};
mybits_t mybits(7);
assert(mybits.field1() == 1);
mybits.field1() = 0;
assert(static_cast<int>(mybits) == 6); //no implicit cast to integers, but explicit supported
There is minimal overhead depending on how aggressively the compiler inlines or optimizes*
Only similar by a little bit. Bitset misses the mark in so many ways.
bit_array can be compile time storage (ala bitset) or dynamic at construction time. There is also bit_vector which is entirely dynamic
Critically though, is that the type/library is built on a proxy iterator/pointer that lets you do certain things with STL or the bit algorithms that you cannot otherwise do with bitset.
But back to the bitfields idea. The bit array backing storage is configurable through templates so you will be able to know for certain exactly how much stack or heap your type will take up
> An implementation may allocate any addressable storage unit large enough to hold a bit-field. If
enough space remains, a bit-field that immediately follows another bit-field in a structure shall be
packed into adjacent bits of the same unit. If insufficient space remains, whether a bit-field that
does not fit is put into the next unit or overlaps adjacent units is implementation-defined. The
order of allocation of bit-fields within a unit (high-order to low-order or low-order to high-order) is
implementation-defined. The alignment of the addressable storage unit is unspecified.
In practice, I've been using this feature for ages (along with __attribute__((packed))). There comes a point where you can start depending on compiler specific features instead of relying solely on the standard.
Can this be standardized? It seems it would have accommodate the most limited architecture, in terms of memory access, probably making it practically useless. Seems best left to the compiler.
wahern · 13h ago
The quoted text is from the C standard. The poplar wisdom is that anything goes wrt bit-fields, but it's an exaggeration.
The original (?) DNS library used bit-fields to read and write the packet header status and flags fields, and that original code is still used today: https://github.com/openbsd/src/blob/master/include/arpa/name... It does rely on implementation-defined behavior--it swaps the order of fields based on whether the machine is little-, big-, or middle-endian--but even that behavior has remained remarkably consistent, at least on Unix ABIs.
I think the wisdom comes from dealing with niche compilers, especially from years ago, where bit-fields were one of the areas where standard adherence was less consistent; and in embedded and kernel contexts, where the implementation-defined and (implicit) undefined behavior, such as wrt atomicity or the intersection of bit-fields and (extension) packed pragmas, counseled against any usage of bit-fields.
ajross · 17h ago
There is, but it's part of the platform ABI and not the C language standard. The latter specifies syntax and behavior, the former is what's concerned with interoperability details like memory layout.
I happen to have a copy of AAPCS64 in front of me, and you can find the specification of bit packing in section 10.1.8. The sysv ABI for x86/x86_64 has its own wording (but I'm pretty sure is compatible). MSVC does something else I believe, etc...
munificent · 13h ago
This is a good article, but it bugs me that it sort of only tells half the story.
Why are you making your structs smaller. Probably for "performance". If your goal is literally just reducing memory usage, then this is all fine. Maybe you're running on a small microcontroller and you just straight up don't have much RAM.
But these days, in most areas, memory is cheap and plentiful. What's expensive is CPU cache. If you're optimizing memory use for that, it's really about runtime speed and less about total memory usage. You're trying to make things small so you have fewer cache misses and the CPU doesn't get stuck waiting as much.
In that case, the advice here is only part of the story. Using bitfields and smaller integers saves memory. But in order to access those fields and do anything with them, they need to get loaded into registers. I'm not an expert here, but my understanding is that loading a bitfield or small int into a word-sized register can sometimes have some bit of overhead, so you have to pay attention to the trade-off.
If I was optimizing for overall runtime speed and considering packing structs to do that, I'd really want to have good profiling and benchmarks in place first. Otherwise it's easy to think you're making progress when you're actually going backwards.
ack_complete · 10h ago
> If you're optimizing memory use for that, it's really about runtime speed and less about total memory usage. You're trying to make things small so you have fewer cache misses and the CPU doesn't get stuck waiting as much.
The complication is that cache is a global resource. Code that has larger data structures, even if it runs faster, can contribute to a larger working set and a slightly higher cache miss rate in other code. This can lead to a death-by-thousand cuts scenario and it's hard to get solid data on this when profiling.
You're right, though, that there are a number of ways that smaller structs can be slower, either directly by execution time or indirectly by larger code causing more misses in the instruction cache. Some other factors include whether the compiler can coalesce accesses to multiple fields grouped together, or whether the struct hits a magic power of two size for faster array indexing or vectorization. Which means you can end up speeding up code occasionally by making the struct bigger.
ljchen · 4h ago
I most agree, only except that perf is determined by a myriad of facors. Even if this piece of data fits into a cache line, it may have no or negative effect on the following code. For complex softwares, even if it's faster after the change, it's hard to tell what's the contributing factor really is. I once read a system paper (forgot the name) from a recent system research conference on how changing the code layout, e.g., moving a block of code, randomly may have unexpected impacts on perf.
Personally, I only pay attention to the cache line and layout when the structure is a concurrent data structure and needs to sync across multiple cores.
voidnap · 5h ago
> memory is cheap and plentiful. What's expensive is CPU cache.
CPU cache is memory. Also memory isn't cheap, it is relatively expensive to access. CPU cache is way cheaper. You have it backwards.
mort96 · 29m ago
The obvious and charitable read of their post is that they use "memory" to mean main memory/RAM, "CPU cache" to mean CPU cache, and "memory is cheap and plentiful" to mean "memory has a low dollar cost and most systems have a whole lot of it".
Your read -- that they use "memory" to mean "anything which stores bytes, including caches", and that they use "memory is cheap" to mean "accessing RAM has a low latency" -- is uncharitable, contrived and even literally self-contradictory ("CPU cache is memory" + "memory is expensive to access while CPU cache is cheap to access" is a contradiction which can only be resolved through using different definitions of "memory" in those two sentences).
writebetterc · 2h ago
They're talking about dollars
bluGill · 9h ago
I pack structs because the packed layout matches some hardware or network structre and thus I get direct access . This isn't a common case but it exists.
yellowapple · 6h ago
That case is much more common in embedded/bare-metal programming. A decent chunk of the code I've written so far for getting Zig running on a Nintendo 64 is a bunch of packed structs representing various registers (and a bunch of methods on those structs representing common uses of those registers).
It's also a reasonably common case for parsing binary data formats (e.g. a file's header has fields at exact byte offsets, and it's much faster and easier to just grab all the bytes of that header and interpret those bytes as a packed struct).
ryao · 12h ago
If you are using an intrusive data structure, keeping the key on the same cache line as the pointers can make traversals faster. Reducing padding can make this easier to do.
In general, keeping commonly accessed together things on the same cache line is a good idea and the work needed to remove excessive padding overlaps with the work needed to do this.
zeusk · 13h ago
The real issue with packed structs and bitfields happens when concurrency gets involved. Majority of modern CPU caches are private and only allow one core to hold the cache line - so it creates more false dependencies when cores are trying to alter information that was compacted into a cache line on another core.
charleslmunger · 9h ago
Avoiding false sharing is a separate problem best solved by explicitly aligning the struct or relevant members to std::hardware_destructive_interference_size.
jcelerier · 11h ago
> But these days, in most areas, memory is cheap and plentiful
and that's why I am contemplating freaking 128G of ram for my next machine, I have 64 right now and OOM regularly, 32 is unuseable
zahlman · 9h ago
What software are you running? I'm humming along without a problem on 8.
jcelerier · 8h ago
just an instance of GCC on a larger file with some complex C++ will easily eat multiple gigabytes of RAM
wbl · 13h ago
Everything is microarch dependant but typically the memory subsystem can handle subword loads and writes. What it usually can't do is bits.
o11c · 15h ago
Note that although alignment and size are platform-specific, in practice padding can be minimized simply by using a fixed order:
(various vector types)
int128_t
long double
long long, int64_t, int_fast64_t, double
off_t
time_t (can be 64 bits on 32-bit platforms only if off_t also is)
register_t, int_fast16_t, int_fast32_t (64-bit on e.g. x32)
void *, void(*)(), intptr_t
ptrdiff_t (theoretically wrong if fully aligned and bigger than pointer)
size_t, ssize_t
long (32-bit on win64; generally has 16-bit alignment if bigger than pointer)
int32_t, float
int (can be 16-bit)
short, int16_t
char, bool, int8_t, int_fast8_t
(a trailing array member must come last; note when calculating combined size you should use offsetof instead of sizeof to not waste padding)
Besides this, `int_leastN_t` goes with `intN_t`, and all unsigned types go with their signed types. Note that types like `id_t` are hard to place however.
Although there do exist many platforms where this is not in order of size, the bigger types generally have smaller alignment on such platforms. Despite this I have split out rows occasionally despite not knowing of any such platform.
kazinator · 16h ago
Generally speaking, sort the members in descending order by alignment, if you want the tightest packing without resorting to nonportable compiler extensions. This descending order eliminates padding between members producing the smallest offset for the last byte of the last member, and that then minimizes the padding required at the end of the struct.
In C, alignment for basic types is linked to size. A type may have an alignment requirement that is as much as its size: e.g. an 8 byte data type might be aligned as strictly as addresses divisible by 8.
The alignment of an array aggregate is that of its element type.
The alignment of a struct is that of its most strictly aligned member.
In portable coding, we don't exactly know, but we can make the worst-case assumption that every basic type has the strictest possible alignment---i.e. its size. We also don't know the sizes of all the types, but we know that integers of higher ranks are at least as large as lower ranks; i.e. sizeof(char) <= sizeof(short) <= sizeof(int) <= sizeof(long), etc.
Things are tricky when you have a mixture of floating-point, pointers and integers. E.g. pointer could be the same size as a float, or bigger.
Here is a trick you use: combine together clumps of like types, and try to make combinations that are powers of two. For instance, if your structure has two or four pointers to things, try to make them adjacent. E.g.:
struct s {
void *p1, *p2, *p3, p4;
short s1, s2, s3, s4;
float f1, f2, f3, f4;
};
Even though float is almost certainly larger than short and may have stricter alignment, its moot because I have four of everything. Even if this is on a tiny system where pointers are 16 bits wide, four of them will make 8 bytes. So when the f1 member is being placed into the struct, the offset is at least 8 byte aligned: we don't expect padding.
If you aggregate enough members of the same type in a power-of-two sized clump, you can easily create alignment which meets or exceeds the strictest alignment in the system.
So those are the main principles: group together members of the same type. Then keeping those groups together, sort by descending alignment strictness.
Recent versions of clangd show pahole-like struct size/alignment padding overhead annotations. I find this built-in support a lot more convenient than using pahole.
uticus · 20h ago
for somebody starting out, where do you go for a list of tools like this?
ryao · 12h ago
pahole is an awesome tool.
variadix · 18h ago
Some other things you can do to improve packing are using integer indices/offsets/relative pointers instead of absolute pointers (especially on 64-bit machines), overlapping fields for unions (as an example you may have a tag that can share some bits with another field in each member of the union), using a structs position in memory to encode information (for example, if you have a binary tree of array allocated nodes, you can specify that the right child always follows the parent in memory, then you only need to store the index of the left child), and of course structure packing pragmas/attributes.
foota · 16h ago
Some related thoughts (more for C++ than C).
You can eliminate the cost of a wrapping class with no members by taking advantage of the empty base class optimization (or rather: by avoiding members in a base class you will be taking advantage of EBCO).
You can use an untagged union in C++ (taking care to avoid undefined behavior) to implement small object optimization. In it's general form, the idea is you can store some smaller struct in a union (as a char[], since C++ doesn't have native unions) with a larger structure, and use a sentinel in the larger structure to determine which it is. E.g., if large struct has some field that must not be zero, then make it zero in the smaller struct and you can distinguish between the two without a tag (like would be required with std::variant).
I'm also a huge fan in general of "make invalid states unrepresentable" and following this can often reduce redundancy and eliminate extra fields.
bhawks · 5h ago
Optimizing struct layouts is cool. but if your motivation is to pack more monsters in your game definitely consider transposing your row oriented monster struct into a column oriented entity using Entity Component System.
It has benefits besides memory/cache optimizations. It organizes logic in a much more composable way that is friendly for reuse.
cogman10 · 17h ago
Nice little article.
The next thing to really explore is how the struct is being used. Smaller is always nice, but sometimes you need it to be fast. To do that, you'll want to group together fields that are commonly read. For example, x, y, and speed are likely always read together. So, it'd make sense to group those fields together to optimize for cache reads.
The next thing to question is if your Monster is being interacted with as an individual thing or as a group of things. If it's as a group of things then it might make sense to create a "Struct of arrays" of monsters. Imagine, for example, you have a system which removes all the dead monsters. It'd be faster to iterate through an `isAlive` array rather than an array of `Monster` structs.
And now you are on your way to reinventing an ECS :)
miguel_martin · 18h ago
>“… Note that I'm not an expert, in anything, so take everything I write about with a grain of salt I guess :).”
>”I’m by far an expert and I could be wrong in some specifications”
You’re missing a “not” on that 2nd sentence ;)
In all seriousness, you don’t need this disclaimer, you’re selling yourself short.
There’s nothing wrong with your post to make the disclaimer.
And if there is something wrong with the post, let the comments flood in and tell you that rather than yourself.
Retr0id · 18h ago
I don't think I've ever seen struct bitfields used in real code, it's almost always manual bit-masks against a numeric type, often via macros. (I assume due to portability)
ziml77 · 13h ago
The Windows API uses bitfields like that in some places. I know because I was working with the display configuration API recently and found the bitfields very annoying to use through generated FFI bindings (especially because they were used in anonymous structs inside anonymous unions).
gizmo686 · 17h ago
I see it decently often in embedded code for memory mapped IO registers. Other than that, I don't think I've ever seen it either.
I'm not sure portability is the reason though. I've seen it in places that have no problem spamming #define __GNU_SOURCE
mort96 · 22m ago
I would be scared of using it for IO... With a struct bit field, I don't know what bit offset a field ends up with, I don't know how it gets padded, and I don't know how large the generated load and store becomes. I'd rather do the bit operations and store myself.
nomel · 15h ago
This is the only place I've seen it used, and the only placed I've used it.
It's absolutely wonderful for IO.
dirt_like · 13h ago
I have seen bit fields used in real code. Long ago. The reason I know it so well is that bit field updates are NOT atomic, and this caused a rare crashing bug under heavy load. Unless you read the header the code looks atomic.
S->buffer_free = 1;
This is a very serious limitation if the structure/class is being updated in a multi-threaded environment.
tom_ · 14h ago
In terms of the underlying bit assingments, bitfields have been pretty safe for a very long time now, as you can assume little-endian. Just a question of bearing in mind the packing rules, not too difficult given stdint.h. I went full bitfield shortly after the time I went full size_t, about 15 years ago now.
Big-endian systems sort-of still exist (I do indeed have an Amiga, and an Atari STe, and an Xbox 360, and they all work, so they must count) - but they're only slightly less weird from a practical C programming perspective than stuff like the DSPs with CHAR_BIT==24, and Harvard architecture microcontrollers with 2-byte ints. And so you can ignore them. People that use still actually these systems will be used to it.
ack_complete · 10h ago
There are some other annoyances, like not being able to inline initialize a bitfield prior to C++20, and sometimes having to use unnatural typing to get optimal packing depending on the ABI. But I've seen them used, compilers have gotten pretty good at optimizing them and can coalesce writes or tests against adjacent bitfields.
abnercoimbre · 18h ago
Yep. If I need a boolean for a struct I know I’ll want more than one. So I’ll stick an integer type that lets me OR multiple enum values (and so on).
anonnon · 7h ago
I have, e.g., in Julia's UTF-8 library (utf8proc).
herf · 14h ago
Sometimes struct-of-arrays (in-memory column store) is the right solution, and you get automatic packing that way. C makes it so much easier to store an array of structs that people don't do this often enough, but column stores prove it has its uses.
_mocha · 15h ago
I'm somewhat surprised to see this on the front page of HN. I distinctly remember this material being covered in a single lecture during our freshman CS course 20 years ago. I doubt this would've reached the front page in HN's early days, and it makes me wonder if we've moved away from teaching these fundamentals due to the heavy reliance on abstraction and ai agents in today's industry.
bigstrat2003 · 4h ago
I have a degree in CS from around the same time frame, and we never covered topics like this. So I think it varies widely, and is not necessarily a modern thing for people to not learn about struct sizing.
asa400 · 15h ago
I've never taken a lick of CS in a formal setting, and I feel like that's increasingly common as programming has broadened its employment base. Most of the people I work with haven't done formal CS, either. Consequently I've had to learn all of this stuff on my own. There are communities out there that value education this kind of education. I can vouch for the Rust community, which has helped me learn _a ton_ about this kind of "lower level" stuff over the last 5 years despite starting my career doing Ruby.
jerf · 15h ago
One of the neat things about programming is that you can develop a high degree of skill in it without any formal education at all.
But it does mean that we've always had those programmers who have had characteristic holes in their education as a result.
Personally I enjoy having a mix of self-taught and formally-taught around; I think there quite a few things the self-taught are often better at than the formally educated as a group. I respect both sides. I'm just saying that there has always been a market for this sort of thing on HN because the self-taught have always been around.
nomel · 15h ago
> I think there quite a few things the self-taught are often better at than the formally educated as a group.
Hmm, the certificate for this site has a billion SANs but catb .org is not one of them...
cjensen · 16h ago
This is good stuff to have "in the toolbox," however habitually organizing structs this way is premature optimization. Much better to sensibly organize the fields and ignore alignment entirely unless you have a really good reason to do this. Good reasons could be that you know you will create a lot of these objects, or it has become a performance issue due to cache misses this would improve.
hubadu · 2h ago
I would suggest to never use unsigned int for values that will involve any sort of calculations on them (health, damage, speed). This is considered bad practice due to overflow and should be avoided, with the exception of values that are used for UIDs, indexes, bitfields, mask and flags.
ChrisRR · 2h ago
Nah it's just whatever is most appropriate for your implementation. Overflow and wrap around is possible whether you use signed or unsigned.
As I say to some more junior devs, check whether the maths is correct and then do it, don't do it and then check if it was wrong
hubadu · 1h ago
My suggestion is straight from the C++ guidelines by Bjarne Stroustrup and Herb Sutter (ES.102: Use signed types for arithmetic), for reasons that you don't seem to grasp.
teo_zero · 14h ago
Neither TFA nor any comment mention using a struct of arrays instead of an array of structs. This technique alone eliminates all padding bytes!
zzo38computer · 17h ago
I would expect many C programmers would know these things, because I do like that when writing programs in C, too (although for bit fields, I will use a single field and then named bits by macros). There are some other considerations as well though, such as static vs dynamic allocation, and the requirements of how the specific program is working.
sim7c00 · 21h ago
definitely interesting points for C structs. regarding CPU it might be interesting to look a little further still regarding optimal sizes for structs regarding cachcing, figuring out hot and cold areas and either including padding or putting cold values behind pointer to a substruct.
like the examples though, very clear what the changes are and the specific impact. always a good reminder to review for these things after a session hacking in new things!
TinkersW · 16h ago
Could improve it by using enum MonsterName :u8, moving it down near the other byte sized item, and specifying the exact type of the bitfield(u8). Then it would be 16 bytes.
fattah25 · 11h ago
Remind me about Andrew Kelley conf "data-oriented design" He had explained about this topic.
uticus · 20h ago
for another good intro to the fun with structs, Beej's guide is always a good goto:
...goes into offsets, padding bytes, bit-field, unions, etc etc.
degenoah · 20h ago
I feel like this page would have definitely helped me when I used to try reverse engineering things I had lying around on my computer.
jesse__ · 17h ago
Minor nitpick, 32bit max value is ~4 billion, not 4 million
bullen · 18h ago
Adding to this you want the struct to be exactly 64 bytes, even on Apple MX processors which have 128 bytes of cache.
warmwaffles · 20h ago
This is also applicable outside of C as well.
mcraiha · 20h ago
"The x_position and y_position can unfortunately not get a smaller type nor can a float be unsigned."
If you have latest C++, you can use float16_t or bfloat16_t to get smaller type
https://en.cppreference.com/w/cpp/types/floating-point.html
david2ndaccount · 19h ago
C23 has _Float16
andrewmcwatters · 17h ago
The article casually ignores the fact that more production code uses bitmasking than bit fields.
1718627440 · 2h ago
Why, though?
jakey_bakey · 18h ago
Loved this, thanks!
4gotunameagain · 16h ago
Unpacked structs are also very useful for parsing network data. You take a package, cast it to an unpacked struct, and can access everything nicely.
Many a spacecraft telemetry stacks work like that.
rkangel · 2h ago
I think you mean "packed" structs. Otherwise you've got padding in your communication protocol.
If you do this you also need to be careful of byte order.
JonChesterfield · 9h ago
The alias analysis pass in your C or C++ compiler will savage you for doing that unless you're careful with annotations or compiler flags though.
westurner · 18h ago
Is it possible to apply these optimizations to Arrow?
Arrow's struct is called StructArray. Fields of StructArray have a StructType.
> Well, dear reader, this padding is added because the CPU needs memory to be aligned in sets of 4 bytes because it’s optimized in that fashion.
> ...
> Remember: since structs are aligned to 4 bytes, any padding is therefore unnecessary if the size of the struct is a multiple of 4 without the padding.
Individual data types have their own alignment (e.g., `bool`/`char` may be 1, `short` may be 2, `int` may be 4, `long` may be 8, etc.), and the alignment of a compound type (like a struct) defaults to the maximum alignment of its constituent types.
In this article, `struct Monster` has an alignment of 4 because `int` and `float` have an alignment of 4 for the author's configuration. Expanding one of the `int`s to a `long` could increase the alignment to 8 on some CPUs, and removing the `int` and `float` fields would decrease the alignment to 1 for most CPUs.
I will add that this is implementation defined. IIRC the only restriction the standard imposes on the alignment of a struct is that a pointer to it is also a pointer to its first member when converted, meaning its alignment must practically be a multiple of that of its first field.
(technically there is still an advantage of items aligned to their size in that such an item can never straddle adjacent cache lines though)
And there's also still tons of different alignment requirements when working with GPU data - and interestingly those alignment requirements may differ from C's alignment rules, so you may need to explicitly use packed structs (which are still not a standard C feature!) with manual padding.
That's not true for quite a lot of CPUs. Pretty much all x64 and stuff don't care
https://lemire.me/blog/2012/05/31/data-alignment-for-speed-m...
TL;DR: 10% difference on what in 2012 was a low-end CPU, no difference on "new in 2012" CPUs. So my guess is that by now it really doesn't matter anymore :)
See https://c.godbolt.org/z/K7oY77KGj
Bitfields in C leave way too much behavior to the compiler or undefined. It's really intolerable.
Even crazier is the fact that an int bitfield's signedness is implementation defined
https://github.com/PeterCDMcLean/BitLib
You can inherit from bit_array and create a pseudo bitfield:
There is minimal overhead depending on how aggressively the compiler inlines or optimizes*bit_array can be compile time storage (ala bitset) or dynamic at construction time. There is also bit_vector which is entirely dynamic
Critically though, is that the type/library is built on a proxy iterator/pointer that lets you do certain things with STL or the bit algorithms that you cannot otherwise do with bitset.
But back to the bitfields idea. The bit array backing storage is configurable through templates so you will be able to know for certain exactly how much stack or heap your type will take up
> An implementation may allocate any addressable storage unit large enough to hold a bit-field. If enough space remains, a bit-field that immediately follows another bit-field in a structure shall be packed into adjacent bits of the same unit. If insufficient space remains, whether a bit-field that does not fit is put into the next unit or overlaps adjacent units is implementation-defined. The order of allocation of bit-fields within a unit (high-order to low-order or low-order to high-order) is implementation-defined. The alignment of the addressable storage unit is unspecified.
In practice, I've been using this feature for ages (along with __attribute__((packed))). There comes a point where you can start depending on compiler specific features instead of relying solely on the standard.
The original (?) DNS library used bit-fields to read and write the packet header status and flags fields, and that original code is still used today: https://github.com/openbsd/src/blob/master/include/arpa/name... It does rely on implementation-defined behavior--it swaps the order of fields based on whether the machine is little-, big-, or middle-endian--but even that behavior has remained remarkably consistent, at least on Unix ABIs.
I think the wisdom comes from dealing with niche compilers, especially from years ago, where bit-fields were one of the areas where standard adherence was less consistent; and in embedded and kernel contexts, where the implementation-defined and (implicit) undefined behavior, such as wrt atomicity or the intersection of bit-fields and (extension) packed pragmas, counseled against any usage of bit-fields.
I happen to have a copy of AAPCS64 in front of me, and you can find the specification of bit packing in section 10.1.8. The sysv ABI for x86/x86_64 has its own wording (but I'm pretty sure is compatible). MSVC does something else I believe, etc...
Why are you making your structs smaller. Probably for "performance". If your goal is literally just reducing memory usage, then this is all fine. Maybe you're running on a small microcontroller and you just straight up don't have much RAM.
But these days, in most areas, memory is cheap and plentiful. What's expensive is CPU cache. If you're optimizing memory use for that, it's really about runtime speed and less about total memory usage. You're trying to make things small so you have fewer cache misses and the CPU doesn't get stuck waiting as much.
In that case, the advice here is only part of the story. Using bitfields and smaller integers saves memory. But in order to access those fields and do anything with them, they need to get loaded into registers. I'm not an expert here, but my understanding is that loading a bitfield or small int into a word-sized register can sometimes have some bit of overhead, so you have to pay attention to the trade-off.
If I was optimizing for overall runtime speed and considering packing structs to do that, I'd really want to have good profiling and benchmarks in place first. Otherwise it's easy to think you're making progress when you're actually going backwards.
The complication is that cache is a global resource. Code that has larger data structures, even if it runs faster, can contribute to a larger working set and a slightly higher cache miss rate in other code. This can lead to a death-by-thousand cuts scenario and it's hard to get solid data on this when profiling.
You're right, though, that there are a number of ways that smaller structs can be slower, either directly by execution time or indirectly by larger code causing more misses in the instruction cache. Some other factors include whether the compiler can coalesce accesses to multiple fields grouped together, or whether the struct hits a magic power of two size for faster array indexing or vectorization. Which means you can end up speeding up code occasionally by making the struct bigger.
Personally, I only pay attention to the cache line and layout when the structure is a concurrent data structure and needs to sync across multiple cores.
CPU cache is memory. Also memory isn't cheap, it is relatively expensive to access. CPU cache is way cheaper. You have it backwards.
Your read -- that they use "memory" to mean "anything which stores bytes, including caches", and that they use "memory is cheap" to mean "accessing RAM has a low latency" -- is uncharitable, contrived and even literally self-contradictory ("CPU cache is memory" + "memory is expensive to access while CPU cache is cheap to access" is a contradiction which can only be resolved through using different definitions of "memory" in those two sentences).
It's also a reasonably common case for parsing binary data formats (e.g. a file's header has fields at exact byte offsets, and it's much faster and easier to just grab all the bytes of that header and interpret those bytes as a packed struct).
In general, keeping commonly accessed together things on the same cache line is a good idea and the work needed to remove excessive padding overlaps with the work needed to do this.
and that's why I am contemplating freaking 128G of ram for my next machine, I have 64 right now and OOM regularly, 32 is unuseable
Although there do exist many platforms where this is not in order of size, the bigger types generally have smaller alignment on such platforms. Despite this I have split out rows occasionally despite not knowing of any such platform.
In C, alignment for basic types is linked to size. A type may have an alignment requirement that is as much as its size: e.g. an 8 byte data type might be aligned as strictly as addresses divisible by 8.
The alignment of an array aggregate is that of its element type.
The alignment of a struct is that of its most strictly aligned member.
In portable coding, we don't exactly know, but we can make the worst-case assumption that every basic type has the strictest possible alignment---i.e. its size. We also don't know the sizes of all the types, but we know that integers of higher ranks are at least as large as lower ranks; i.e. sizeof(char) <= sizeof(short) <= sizeof(int) <= sizeof(long), etc.
Things are tricky when you have a mixture of floating-point, pointers and integers. E.g. pointer could be the same size as a float, or bigger.
Here is a trick you use: combine together clumps of like types, and try to make combinations that are powers of two. For instance, if your structure has two or four pointers to things, try to make them adjacent. E.g.:
Even though float is almost certainly larger than short and may have stricter alignment, its moot because I have four of everything. Even if this is on a tiny system where pointers are 16 bits wide, four of them will make 8 bytes. So when the f1 member is being placed into the struct, the offset is at least 8 byte aligned: we don't expect padding.If you aggregate enough members of the same type in a power-of-two sized clump, you can easily create alignment which meets or exceeds the strictest alignment in the system.
So those are the main principles: group together members of the same type. Then keeping those groups together, sort by descending alignment strictness.
You can eliminate the cost of a wrapping class with no members by taking advantage of the empty base class optimization (or rather: by avoiding members in a base class you will be taking advantage of EBCO).
You can use an untagged union in C++ (taking care to avoid undefined behavior) to implement small object optimization. In it's general form, the idea is you can store some smaller struct in a union (as a char[], since C++ doesn't have native unions) with a larger structure, and use a sentinel in the larger structure to determine which it is. E.g., if large struct has some field that must not be zero, then make it zero in the smaller struct and you can distinguish between the two without a tag (like would be required with std::variant).
I'm also a huge fan in general of "make invalid states unrepresentable" and following this can often reduce redundancy and eliminate extra fields.
Example framework:
https://github.com/skypjack/entt?tab=readme-ov-file
It has benefits besides memory/cache optimizations. It organizes logic in a much more composable way that is friendly for reuse.
The next thing to really explore is how the struct is being used. Smaller is always nice, but sometimes you need it to be fast. To do that, you'll want to group together fields that are commonly read. For example, x, y, and speed are likely always read together. So, it'd make sense to group those fields together to optimize for cache reads.
The next thing to question is if your Monster is being interacted with as an individual thing or as a group of things. If it's as a group of things then it might make sense to create a "Struct of arrays" of monsters. Imagine, for example, you have a system which removes all the dead monsters. It'd be faster to iterate through an `isAlive` array rather than an array of `Monster` structs.
And now you are on your way to reinventing an ECS :)
>”I’m by far an expert and I could be wrong in some specifications”
You’re missing a “not” on that 2nd sentence ;)
In all seriousness, you don’t need this disclaimer, you’re selling yourself short.
There’s nothing wrong with your post to make the disclaimer.
And if there is something wrong with the post, let the comments flood in and tell you that rather than yourself.
I'm not sure portability is the reason though. I've seen it in places that have no problem spamming #define __GNU_SOURCE
It's absolutely wonderful for IO.
S->buffer_free = 1;
This is a very serious limitation if the structure/class is being updated in a multi-threaded environment.
Big-endian systems sort-of still exist (I do indeed have an Amiga, and an Atari STe, and an Xbox 360, and they all work, so they must count) - but they're only slightly less weird from a practical C programming perspective than stuff like the DSPs with CHAR_BIT==24, and Harvard architecture microcontrollers with 2-byte ints. And so you can ignore them. People that use still actually these systems will be used to it.
But it does mean that we've always had those programmers who have had characteristic holes in their education as a result.
Personally I enjoy having a mix of self-taught and formally-taught around; I think there quite a few things the self-taught are often better at than the formally educated as a group. I respect both sides. I'm just saying that there has always been a market for this sort of thing on HN because the self-taught have always been around.
Would you mind expanding on this a bit?
No comments yet
And there is a related talk by Andrew Kelley implementing these ideas on Zig toolchain: https://www.youtube.com/watch?v=IroPQ150F6c
As I say to some more junior devs, check whether the maths is correct and then do it, don't do it and then check if it was wrong
like the examples though, very clear what the changes are and the specific impact. always a good reminder to review for these things after a session hacking in new things!
https://beej.us/guide/bgc/html/split-wide/structs-ii-more-fu...
...goes into offsets, padding bytes, bit-field, unions, etc etc.
Many a spacecraft telemetry stacks work like that.
If you do this you also need to be careful of byte order.
Arrow's struct is called StructArray. Fields of StructArray have a StructType.
StructType has .bit_width and .byte_width attrs in Python and probably the other implementations too: https://arrow.apache.org/docs/python/generated/pyarrow.Struc...
Arrow supports bitfields with BooleanArray, and enums with categoricals but
"BUG: Categorical columns using the PyArrow backend requires 4x more memory" (open) https://github.com/pandas-dev/pandas/issues/58062 :
> On disk Parquet appears to store the category data as logical type String which is compressed with snappy and encoded
Arrow Flight RPC handles nested structs with enums over the wire somehow too FWIU