Everything I know about good system design

59 dondraper36 31 8/16/2025, 7:38:46 AM seangoedecke.com ↗

Comments (31)

KronisLV · 6m ago
> Schema design should be flexible, because once you have thousands or millions of records, it can be an enormous pain to change the schema. However, if you make it too flexible (e.g. by sticking everything in a “value” JSON column, or using “keys” and “values” tables to track arbitrary data) you load a ton of complexity into the application code (and likely buy some very awkward performance constraints). Drawing the line here is a judgment call and depends on specifics, but in general I aim to have my tables be human-readable: you should be able to go through the database schema and get a rough idea of what the application is storing and why.

I’m surprised that the drawbacks of EAV or just using JSON in your relational database don’t get called out more.

I’d very much rather have like 20 tables with clear purpose than seeing that colleagues have once more created a “classifier” mechanism and are using polymorphic links (without actual foreign keys, columns like “section” and “entity_id”) and are treating it as a grab bag of stuff. One that you also need to read the application code a bunch to even hope to understand.

Whenever I see that, I want to change careers. I get that EAV has its use cases, but in most other cases fuck EAV.

It’s right up there with N+1 issues, complex dynamically generated SQL when views would suffice and also storing audit data in the same DB and it inevitably having functionality written against it, your audit data becoming a part of the business logic. Oh and also shared database instances and not having the ability to easily bootstrap your own, oh and also working with Oracle in general. And also putting things that’d be better off in the app inside of the DB and vice versa.

There are so many ways to decrease your quality of life when it comes to storing and accessing data.

dondraper36 · 46s ago
There's a great book SQL Antipatterns, by Bill Karwin where this specific antipattern is discussed and criticized.

That said, sometimes when I realize there's no way for me to come up even with a rough schema (say, some settings object that is returned to the frontend), I use JSONB columns in Postgres. As a rule of thumb, however, if something can be normalized, it should be, since, after all, that's still a relational database despite all the JSON(B) conveniences and optimizations in Postgres.

bambax · 50m ago
> When querying the database, query the database. It’s almost always more efficient to get the database to do the work than to do it yourself. For instance, if you need data from multiple tables, JOIN them instead of making separate queries and stitching them together in-memory.

Oh yes! Never do a join in the application code! But also: use views! (and stored procedures if you can). A view is an abstraction about the underlying data, it's functional by nature, unlikely to break for random reasons in the future, and if done well the underlying SQL code is surprisingly readable and easy to reason about.

quietbritishjim · 9m ago
I think it's ok to have this rule as a first approximation, but like all design rules you should understand it well enough that know when to break it.

I worked on an application which joined across lots of tables, which made a few dozen records balloon to many thousands of result rows, with huge redundancy in the results. Think of something like a single conceptual result having details A, B, C from one table, X, Y from another table, and 1, 2, 3 from another table. Instead of having 8 result rows (or 9 if you include the top level one from the main table) you have 18 (AX1, AX2, AX3, AY1, ...). It gets exponentially worse with more tables.

We moved to separate queries for the different tables. Importantly, we were able to filter them all on the same condition, so we were not making multiple queries to child tables when there were lots of top-level results.

The result was much faster because the extra network overhead was overshadowed by the saving in query processing and quantity of data returned. And the application code was actually simpler, because it was a pain to pick out unique child results from the big JOIN. It was literally a win in every respect with no downsides.

(Later, we just stuffed all the data into a single JSONB in a single table, which was even better. But even that is an example of breaking the old normalisation rule.)

9rx · 4m ago
> which made a few dozen records balloon to many thousands of result rows

That doesn't really sound like a place where data is actually conceptually joined. I expect, as it is something commonly attempted, that you were trying to abuse joins to try and work around the n+1 problem. As a corollary to the above, you also shouldn't de-join in application code.

bob1029 · 9m ago
This is a big part of what makes ORMs a problem.

Writing raw SQL views/queries per MVC view in SSR arrangements is one of the most elegant and performant ways to build complex web products. Let the RDBMS do the heavy lifting with the data. There are optimizations in play you can't even recall (because there's so many) if you're using something old and enterprisey like MSSQL or Oracle. The web server should be able to directly interpolate sql result sets into corresponding <table>s, etc. without having to round trip for each row or perform additional in memory join operations.

The typical ORM implementation is the exact opposite of this - one strict object model that must be used everywhere. It's about as inflexible as you can get.

tossandthrow · 15m ago
Views make good sense when you can check them in - and DB migrations are a poor way of doing it due to their immutable nature.

Depending on the ecosystem the code base adopts a good orm might be a better choice to do joins.

CafeRacer · 11m ago
I came here to say an exactly opposite things. There were a few instances where a relatively heavy join would not perform well, no matter what I tried. And it was faster to load/stitch data together with goroutines. So I just opted to doing it that way.

Also SQL is easy, but figuring out what's up with indexes and planner is not.

tetha · 9m ago
The distinction of stateful and stateless is one of the main criteria how we're dividing responsibilities between platform-infra and development.

I know it's a bit untrue, but you can't do that many things wrong with a stateless application running in a container. And often the answer is "kill it and deploy it again". As long as you don't shred your dataset with a bad migration or some bad database code, most bad things at this level can be fixed in a few minutes with a few redeployments.

I'm fine having a larger amount of people with a varying degree of experience, time for this, care and diligence working here.

With a persistence like a database or a file store, you need some degree of experience of what you have to do around the system so it doesn't become a business risk. Put plainly, a database could be a massive business risk even if it is working perfectly... because no one set backups up.

That's why our storages are run by dedicated people who have been doing this for years and years. A bad database loss easily sinks ships.

ZYbCRq22HbJ2y7 · 50m ago
> You’re supposed to store timestamps instead, and treat the presence of a timestamp as true. I do this sometimes but not always - in my view there’s some value in keeping a database schema immediately-readable.

Seems overly negative of broad advice on a good pattern?

    is_on => true
    on_at => 1023030
Sure, that makes sense.

     is_a_bear => true
     a_bear_at => 12312231231
Not so much, as most bears do not become bears at some point after not being a bear.
grey-area · 1m ago
I’d see the booleans as a bad thing in almost all cases, instead of a boolean you can have a timestamp or an integer field (which can expand later).

In the is_a case almost always a type or kind is better as you’ll rarely just have bears even if you only start with bears, just as you rarely have just two states for a status field (say on or off), often these expand in use to include things like suspended, deleted and asleep.

So generally I’d avoid booleans as they tend to multiply and increase complexity partially when they cover mutually exclusive states like live, deleted and suspended. I have seen is_visible, is_deleted and is_suspended all on the same table (without a status) and the resulting code and queries are not pretty.

I’d use an integer rather than a timestamp to replace them though.

setr · 15m ago
If you take the statement at face value — essentially storing booleans in the db ever is a bad smell - then he’s correct.

Although I’m not even sure it’s broadly a good principle, even in the on_at case; if you actually care about this kind of thing, you should be storing it properly in some kind of audit table. Switching bool to timestamp is more of a weird lazy hack that probably won’t be all that useful in practice because only a random subset of data is being tracked like that (Boolean data type definitely isn’t the deciding factor on whether it’s important enough to track update time on).

I’ve got the same suspicion with soft-deletes — I’m fairly positive it’s useless in practice, and is just a mentally lazy solution to avoid proper auditing. Like you definitely can’t just undelete it, and it doesn’t solve for update history, so all you’re really protecting against is accidental bulk delete caught immediately? Which is half the point of your backup

moebrowne · 4m ago
It's well documented that soft delete is more of a headache than it's worth

https://brandur.org/soft-deletion

maxbond · 8m ago
Audit tables are a big ask both in terms of programming effort to design and support them, and in terms of performance hit due to write amplification (all inserts and updates cause an additional write to an audit table). Whereas making a bool into a timestamp is free. Including timestamps on rows (including created_at and updated_at) are real bacon savers when you've deployed a bug and corrupted some rows and need to eg refund orders created in a certain window.
spiddy · 4m ago
though why treat booleans as special case and keep timestamps for them when you don’t for integers with this pattern:

isDarkTheme: {timestamped} paginationItems: 50

I can see when dark theme was activated but not when pagination was set to 50.

also, i can’t see when dark theme is being deactivated either.

seems like a poor-man changelog. there maybe use cases for it but i can’t think of anything tbh.

Lionga · 2m ago
All this general advice is quite useless and needs millions of asterix.

Good system design is designing a system that works best for the problem at hand.

seafoamteal · 15m ago
I think in that situation, you could have an enum value that contains Bear and whatever other categories you are looking at.
ZYbCRq22HbJ2y7 · 3m ago
Sure, but this was for demonstration purposes showing that some data has other meaning that doesn't have an instantiation state dependent on time that is separate from the creation of the entire record.
com · 1h ago
The advice about logging and metrics was good.

I had been nodding away about state and push/pull, but this section grabbed my attention, since I’ve never seen it do clearly articulated before.

bravesoul2 · 1h ago
Yes. Everyone should spent the small amount of time getting some logging/metrics going. It's like tests, getting from 0-1 test is psychologically hard in a org but 1-1000 then becomes "how did I live without this". Grafana has a decent free tier or you can self host.
dondraper36 · 1h ago
The logging part is spot on. It has happened so many times when I thought, "Oh, I wish I had logged this.", and then you face an issue or even an incident and introduce these logs anyways.
bravesoul2 · 1h ago
It is a balance. Too many logs cost money and slow down log searches both for the search and the human seeing 100 things on the same trace.
jillesvangurp · 51m ago
The trick here is to log aggressively and then filter aggressively. Logs only get costly if you keep them endlessly. Receiving them isn't that expensive. And keeping them for a short while won't break the bank either. But having logs pile up by the tens of GB every day gets costly pretty quickly. Having aggressive filtering means you don't have that problem. And when you need the logs, temporarily changing the filters is a lot easier than adding a lot of ad hoc logging back into the system and deploying that.

Same with metrics. Mostly they don't matter. But when they do, it's nice if it's there.

Basically, logging is the easy and cheap part of observability, it's the ability to filter and search that makes it useful. A lot of systems get that wrong.

bravesoul2 · 36m ago
Nice. I'm going to read up more about filtering.
dondraper36 · 1h ago
Yeah, absolutely. But the author's idea of logging all major business logic decisions (that users might question later) sounds reasonable.
bravesoul2 · 1h ago
Yes. I like the idea of assertions too. Log when an assertion fails. Then get notified to investigate.
magnio · 56m ago
I think it's a very good article. Even if you disagree with some of the individual points in it, the advice given are very concrete, pragmatic, and IMO tunable to the specifics of each project.

On state, in my current project, it is not statefulness that causes trouble, but when you need to synchronize two stateful systems. Every time there's bidirectional information flow, it's gonna be a headache. The solution is of course to maintain a single source of truth, but with UI application this is sometimes quite tricky.

bravesoul2 · 1h ago
He doesnt seem to mention Conway or team topology which is an important part of system design too.
dondraper36 · 1h ago
Well, as sad as it is, such advice is often applicable to new projects when you still have runway for your own decisions.

For mostly political reasons, if you are onboarded to a team with a billion microservices and a lot of fanciness, it's unlikely that you will ever get approval or time to introduce simplicity. Or maybe I just got corrupted myself by the reality where I have to work now.

bravesoul2 · 1h ago
There is definitely a wood for the trees issue at bigger companies. I doubt there is an architect who understands the full system to see how to simplify it. Hard to even know what "simpler" looks like.
jillesvangurp · 19m ago
You should adapt your team to the architecture, not the other way around.

My former Ph.D. supervisor who moonlights as a consultant on this topic uses a nice acronym to capture this: BAPO. Business, Architecture, Process, and Organization. The idea is to end up with optimal business, an optimal architecture & design for that business, the minimum of manual processes that are necessitated by that architecture, and an organization that is efficiently executing those processes. So, you should design and engineer in that order.

Most companies do this in reverse and then end up limiting their business with an architecture that matches whatever processes that their org chart necessitated years ago in a way that doesn't makes any logical sense whatsoever except in the historical context of the org chart. If you come in as a consultant to fix such a situation, it helps understanding that whatever you are going to find is probably wrong because of this reason. I've been in the situation where I come in to fix a technical issue and immediately see that the only reason the problem exists is the org chart is bullshit. That can be a bit awkward but lucrative if you deal with it correctly. It helps asking the right questions before you get started.

Turning that around means you start from the business end (where's the money coming from?, what value can we create?, etc.), finding a solution that delivers that and then figure out processes and organizational needs. Many companies start out fairly optimal and then stuff around them changes and they forget to adapt to that.

Having micro services because you have a certain team structure is a classic mistake here. You just codified your organizational inefficiency. Before you even delivered any business value. And now your organizational latency has network latency to match that. Probably for no good reason other than that team A can't be trusted to work with team B. And even if it's optimal now, is it going to stay optimal?

If you are going to break stuff into (micro) services, do so for valid business/technical reasons. E.g. processing close to your data is cheaper, caching for efficiency means stuff is faster and cheaper, physically locating chunks of your system close to the customer means less latency, etc. But introducing network latency just because team A can't work with team B, is fundamentally stupid. Why do you even have those teams? What are those people doing? Why?