with movieIds as (select id from Movie where title = $1),
actorIds as (select Actor.id from Actor join ActorMovie on [...]
where ActorMovie.Movie in movieId),
alsoActedIn as (select id from ActorMovie where actor in actorId),
movieResults as (select * from Movie where id in movieIds),
actorResults as (select * from Actor where id in actorIds),
alsoActedInResults as (select * from Movie join ActorMovie on [...]
where ActorMovie.id in alsoActedIn)
select * from movieResults
full outer join actorResults on false
full outer join alsoActedInResults on false;
Not every database supports "full outer join on false," and sometimes you have to add "as not materialized" to the "with" subqueries in order for it to be performant, but it works in Postgres, and you end up with a results table that looks something like this:
This pattern has saved me from some truly awful query logic.
gerad · 1h ago
couldn't you do a union all instead of outer join on false?
greggyb · 11h ago
Coming from a data engineering and business analytics background, the terms "structured" and "unstructured" seem to be swapped in meaning here.
I'd expect to talk about anything in a DB as "structured data". Flexible serialization formats such as JSON or XML are "semi-structured". And something from e.g., an office document is "unstructured".
This is not a complaint or criticism. I understand the article just fine. It's just interesting how different perspectives can affect what words mean (:
sinfulprogeny · 10h ago
Kleppman[1] calls it schema-on-read (json, xml) and schema-on-write (typed columns in an RDB). I like it over structured/unstructured, it's a bit more specific.
Yes, I agree with that preference. I don't love the verbiage of "structured" / "unstructured" in either usage (the article's or that which I shared).
stevage · 9h ago
I think the SQL sense is more that "structured" means "it contains internal structure" (like a tree), whereas a table without JSON is free from additional structure apart from its own table structure.
thaumasiotes · 9h ago
> Coming from a data engineering and business analytics background, the terms "structured" and "unstructured" seem to be swapped in meaning here.
Mark Rosewater likes to write about his personal concept of "linear" Magic: the Gathering decks, which are decks in which the cards tend to pick up synergy bonuses from each other, so that having two of them together is considerably more powerful than you'd expect from the strength of those two cards individually.
This always bothers me because it is the opposite of the more normal use of "linear" relationships, in which everything contributes exactly as much to an aggregate as it's worth individually.
stared · 10h ago
I used to work bit with MongoDB (over 10 years ago) and it did wonders in making me fall back into love with relational databases. I finally saw the light of Codd!
SQL, as a language is clunky, true. It can be patched here are there, either by PipeSQL or by various ORMs. I agree, that it would be wonderful to have standardized tooling for generating JSON like in the post.
Yet, with relational databased you can separate concerns of: what is your data and what you want to display. If you use JSON-like way to store data, it can do the job until you want to change data or queries.
andyferris · 8h ago
100%.
The shame, to me, is that SQL is _unnecessarily_ clunky with producing query results with nested/heirarchical data. The relational model allows for any given value (field or cell value) to be itself a relation, but SQL doesn't make it easy to express such a query or return such a value from the database (as Jamie says - often the API server has to "perform the join again" to put it in nested form, due to this limitation).
zozbot234 · 21m ago
> SQL is _unnecessarily_ clunky with producing query results with nested/heirarchical data.
Property Graph Query (PGQ) is now a part of SQL and is expected to help with expressing these complex queries.
strbean · 56m ago
SQL is unnecessarily clunky in just about every respect. It's mind boggling that the syntax of a language designed in the 1970s has so many ardent defenders.
bazoom42 · 5h ago
> The relational model allows for any given value (field or cell value) to be itself a relation
First normal form explicitly forbids nested relations though. Relational algebra does not support nested relations for this reason.
But perhaps nesting relations might make sense as the final step, just like sorting, which is not supported by the pure relational model either.
marcosdumay · 40m ago
If your relational algebra comes with normative opinions about data normalization, there's something really strange with the entire way you think about mathematics.
Usually, relational algebra doesn't have many restrictions about the type of the atomic values it deals with, what makes sequences and relations perfectly valid candidates.
But yeah, there are many reasons for you to normalize your data at rest.
arnsholt · 10h ago
My stint with MongoDB was brief, but I too came away with a deeper appreciation of SQL bases. I feel it's a bit like a good type system: yes, there absolutely is an upfront cost, but over time it really does save you from stupid runtime problems. It's especially important when the bugs are in the data storage layer, because if a bug causes bad data to be written to your database, not only do you need to find and fix the bug, you also have to figure out how to deal with (possibly lots of) bad data.
Also, modern SQL is an incredibly powerful language. Good SQL can save you from lots of dumb data munging code if you know how to wield your database properly. Especially for analytical queries, but also more typical application code IMO.
anonymars · 2h ago
I have spent many years beating the drum that a few minutes spent on constraints today save many, many, many hours of painful data cleanup later
And in fact, good constraints also can improve query performance. If the optimizer knows this column is unique, or that column is guaranteed to have a corresponding value in the joined table, it can do all sorts of tricks...
da_chicken · 9h ago
Yes, nothing will make you appreciate the relational model more than using a database where the designer broke first normal form. Or where the developer thought the data in the database was theirs and it was only ever going to be used by just their application.
Something else I find confusing is that every developer seems to want every database query for a single object to return a single row with a thousand columns, and anything multi row or with multiple results is way too complicated to handle. This goes double for report writing software.
I really wonder what we're doing with database providers (drivers) that makes people want to develop ORMs when ORMs consistently feel like 10 ton gorillas. If the database world is so disparaging of the "row by excruciating row" processing, why do the drivers exclusively present data that way and no other?
mrlongroots · 7h ago
Agree: flat/relational structure is highly efficient to store, query planning/optimization/scaling become a lot easier, the language is clunky but relational algebra is beautiful, but...
I think pushing multiple joins to what is still a relational database and getting a complex output isn't the worst idea in the world, as a higher-level layer on top of a regular database.
On the other hand, "it needs four queries/RTTs" is not the worst thing in the world. It needn't be the goal of a system to achieve theoretical minimum performance for everything.
Let those who truly have the problem in prod push the first patch.
andyferris · 9h ago
Yes - I have to agree.
Codd was right in that if you want transactional semantics that are both quick and flexible, you'll need to _store_ your data in normalized relations. The system of record is unwieldly otherwise.
The article is right that this idea was taken too far - queries do not need to be restricted to flat relations. In fact the application, for any given view, loves heirarchical orginization. It's my opinion that application views have more in common with analytics (OLAP) except perhaps latency requirements - they need internally consistent snapshots (and ideally the corresponding trx id) but it's the "command" in CQRS that demands the normalized OLTP database (and so long as the view can pass along the trx id as a kind of "lease" version for any causally connected user command, as in git push --force-with-lease, the two together work quite well).
This issue is of course that SQL eshews hierarchical data even in ephemeral queries. It's really unfortuante that we generate jsonb aggregates to do this instead of first-class nested relations a la Dee [1] / "third manifesto" [2]. Jamie Brandon has clearly been thinking about this a long time and I generally find myself nodding along with the conclusions, but IMO the issue is that SQL poorly expresses nested relations and this has been the root cause of object-relation impedence since (AFAICT) before either of us were born.
> The article is right that this idea was taken too far
The biggest mistake was thinking we could simply slap a network on top of SQL and call it a day. SQL was originally intended to run locally. You don't need fancy structures so much when the engine is beside you, where latency is low, as you can fire off hundreds of queries without thinking about it, which is how SQL was intended to be used. It is not like, when executed on the same machine, the database engine is going to be able to turn the 'flat' data into 'rich' structures any faster than you can, so there is no real benefit to it being a part of SQL itself.
But do that same thing over the network and you're quickly in for a world of hurt.
reaanb2 · 6h ago
In my view, the O/R impedance mismatch derives from a number of shortcomings. Many developers view entities as containers of their attributes and involved only in binary relationships, rather than the subjects of n-ary facts. They map directly from a conceptual model to a physical model, bypassing logical modeling. They view OOP as a data modeling system, and reinvent network data model databases and navigational code on top of SQL.
mcphage · 5h ago
In that case, the mismatch is between "What developers need" and "What SQL provides".
kragen · 9h ago
I agree about first-class nested relations, but I don't agree about transactions.
Codd was writing 10 years before the idea of transactional semantics was formulated, and transactions are in fact to a great extent a real alternative to normalization. Codd was working to make inconsistent states unrepresentable in the database, but transactions make it a viable alternative to merely avoid committing inconsistent states. And I'm not sure what you mean by "quick", but anything you could do 35 years ago in 10 milliseconds is something you can do today in 100 microseconds.
andyferris · 8h ago
It's not about just _transactions_. What you wrote is 100% correct.
It's specifically about _fast_ transactions in the OLTP context. When talking about the 1970s (not 1990s) and tape drives, rewriting a whole nested dataset to apply what we'd call a "small patch" nowadays wasn't a 10 millisecond job - it could feasibly take 10s of seconds or minutes or hours. That a small patch to the dataset can happen almost instantly - propagated to it's containing relation, and a handful of subordinate index relations - was the real advance in OLTP DBs. (Of course this never has and never will help with "large patches" where the dataset is mostly rewritten, and this logic doesn't apply to the field of analytics).
Perhaps Codd "lucked out" here or perhaps he didn't have the modern words to describe his goal, but nonetheless I think this is why we still use flat relations as our systems of record. Analytical/OLAP systems do vary a lot more!
kragen · 8h ago
Hmm, but I think people doing OLTP in the 01970s were largely using things like IMS, which used ISAM, on disk, to be able to do small updates to large nested datasets very quickly? And for 20+ years one of the major criticisms of relational databases was that they were too slow? And that even today the remaining bastions of IMS cite performance as their main reason for not switching to RDBMSes?
I think that if you're processing your transactions on tape drives, your TP isn't OL; it's offline transaction processing.
I think Codd's major goal was decoupling program structure from on-disk database structure, not improving performance. There's a lot of the history I don't know, though.
gethly · 9h ago
I am using event sourcing with cqrs and that means i also utilise projections, which the db itself can replicate less efficiently as a "view" or virtual table.
So one can technically create a projection/view that is tailor-made for a query that needs to display some data. Of course it is no often possible to retrieve all the data with a single select command.
So joins and multiple queries are simply inherent to complexity of data we store nowadays.
Anyway, years ago, i have moved to a db model where every entity is stored as a blob in a dedicated column and every additional column, beside id, is indexed. So there is no wasted space and a ton of columns that only hold data but are not used for filtering. I can run data-efficient queries that yield a list of ids of blobs to load or the blobs themselves and then i extract any necessary data out of those blobs(entities) on the application level. So the database us purely a blob store + few fast indices.
kragen · 9h ago
What's your application?
gethly · 7h ago
I use it on gethly.com and previously i used the same approach on an internal application for domain registrar that handles the EPP. So far no issues. Also using those blobs allows me to easily add encryption layer on application level.
kragen · 6h ago
I see: "Gethly is a paywalled hosting and sales platform for digital content creators.
Sell online courses, license downloadable content, receive donations or build communities accessible via paid memberships." Are all the online courses and downloadable content and analytics and payments in the CQRS system, or are you using it for only some subset?
gethly · 5h ago
The entire system is event-sourced. I wrote few articles about the technical aspect of the platform on the blog https://gethly.com/blog
At this time, there are 12 services that make up the entire application. Event-sourcing is what allows infinite scaling and CQRS.
ES is the pinnacle of technology in the web sector, but it comes with a lot of overhead and time to market is significantly slowed down. So it is not something every project can or should consider implementing.
kragen · 1h ago
Thank you very much!
kragen · 9h ago
An interesting thing about "A Relational Model of Data for Large Shared Data Banks" (the stone tablets handed down from Codd https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf) is that it starts out by considering N-ary relations whose values may themselves be relations, and only then switches to considering "normalized" relations whose values are not relations:
> A relation [table] whose domains [column types] are all simple [not relations] can be represented in storage by a two-dimensional column-homogeneous array of the kind discussed above. Some more
complicated data structure is necessary for a relation with
one or more nonsimple domains. For this reason (and others
to be cited below) the possibility of eliminating nonsimple
domains appears worth investigating.⁴ There is, in fact, a
very simple elimination procedure, which we shall call
normalization.
But non-normalized relations support the kind of nested structure the eminent Dr. Brandon wants, without resorting to JSON or abandoning the strong uniform typing we have with SQL. Darwen & Date's The Third Manifestohttps://www.dcs.warwick.ac.uk/~hugh/TTM/DTATRM.pdf includes group and ungroup operations which translate back and forth between normalized and non-normalized relations.
I've been playing around with some of these ideas using some rather horrific atrocities perpetuated against Lua operator overloading in http://canonical.org/~kragen/sw/dev3/cripes.lua. I haven't added joins or ungrouping to it yet, but I think it might be a prototype of something promising for this kind of use case.
mrkeen · 9h ago
The author didn't see fit to mention that you just flip the arrows.
Classes/Structs know about their children.
Relations know about their parents. If you want crazier m*n relationships you use an association table.
Did the author just not know? Or he didn't see it worthy of dismissal?
> Doing this transformation by hand is tedious and error-prone. We call this tedium "the object-relational mismatch" but it isn't really about objects or relations.
It really is.
> The fundamental problem is that fitting complex relationships to human vision usually requires constructing some visual hierarchy, but different tasks require different hierarchies.
Yes. Different tasks require different hierarchies. One particular way of doing things should not baked into your 1970s relational model.
sgarland · 7h ago
The only reason this post has any truth is that the IMDB dataset is horrendously unnormalized. It also points out at the end that you can in fact use a combination of aggregations to return the desired JSON object.
Even then, if you really wanted to, you can absolutely make a tree-like structure in SQL in a variety of ways: adjacency lists, nested sets, closure tables, etc.
mcphage · 5h ago
> you can absolutely make a tree-like structure in SQL in a variety of ways: adjacency lists, nested sets, closure tables, etc.
That's kinda the problem—rather there being 1 way to make a tree-like structure in SQL, that works correctly, there's a lot of ways to do it, and they all have different tradeoffs, and they're all a bit obnoxious.
sgarland · 1h ago
There are multiple ways to do just about anything, in any programming language.
ThinkBeat · 9h ago
Would this not become a bit simpler if
you populate the sql schema like this?
This was enlightening. I've often wondered why ORMs exist, and it turns out a bunch of devs don't understand tabular data. I was going to say they don't understand SQL, but apparently it goes deeper than that.
Like, left joins have been around since SQL-92, so the query not returning a result when there is no director is a skill issue.
Nowadays, you can run a JSON or XML query and have the DB return your results in a hierarchical format.
> - In a small team (averaging ~3 people, I think?) in a little under a year, we built a SQL query planner that passed the >6 million tests in the sqlite logic test suite.
> - I figured out how to decorrelate arbitrary SQL subqueries (I wasn't the first person to figure this out, but I got there independently).
> - I demonstrated a mode of consistency failure in various popular streaming systems that I believe was not widely understood at the time.
I think it is likely that the reason you disagree with him is not that he "don't understand tabular data" and has "a skill issue". Unless you're secretly Michael Stonebraker?
You write:
> Nowadays, you can run a JSON or XML query and have the DB return your results in a hierarchical format.
You may not have noticed this, but the post you are commenting on explains how to run a JSON query and have the DB return your results in a hierarchical format.
mcdonje · 8h ago
My take was charitable because if he knows better, then he's making a bad faith argument, and I didn't want to accuse him of that.
I didn't read beyond the article. I only responded to what was in the post.
He didn't use capabilities SQL has had for decades, made a bad query, and used that to build his argument about how bad SQL is.
The ORM tables could've been written in one query that returns a tabular format of hierarchical data, which the front end could use.
What does that look like? It looks like the kind of underlying tabular format JSON & XML queries require. He didn't do that with his non-JSON examples.
Yes, the framework would need to turn it into an object. Yes, there is data duplication in some fields. While not ideal, it's still better than the ORM example because you're not splitting it up into multiple queries.
His JSON example undercuts his main argument. The DB is fully capable of returning the representation he wants. It's also more complex than necessary with all the JSONB_AGGs in the subqueries. Then he says you can see the duct tape.
So yeah, it reads like a front end person who never really learned modern SQL. Or maybe it's someone who knows better who is making a bad faith argument. Either way, it's just a bad take.
kragen · 7h ago
I don't think his final query is more complex than necessary? Maybe there's a better approach that I'm not seeing.
mcdonje · 7h ago
Well for one thing, he's using a bunch of correlated subqueries, which is a bad idea. He even links to an article about why it's a bad idea and then does it anyway. Generally easily avoidable.
For another, postgres has more JSON functions than just jsonb_agg.
I don't have time to write something out right now, but there are definitely options.
kragen · 7h ago
I'm not sure there's a better option in Postgres, and I'm not convinced you understand his main argument. How would you restate it?
mcdonje · 6h ago
You're not sure there's a better option than JSONifying a bunch of correlated subqueries individually?
kragen · 6h ago
That's right: I'm not sure there's a better option in Postgres than JSONifying a bunch of correlated subqueries individually. How would you do it?
It's kind of a red flag, though, that you didn't answer my question about how you would restate his main argument. It suggests to me that you're just here to flame instead of to have a productive discussion.
mcdonje · 5h ago
I already said I don't have time to rewrite the query right now, but I'd definitely get rid of the correlated subqueries, which should be obvious from my comments. I may or may not use an intermediate representation, like a CTE, and I'd probably make use of JSONB_OBJECT_AGG.
I kind of thought it was a red flag that you were accusing me of not understanding the post, which is about how SQL is bad at hierarchical data, while defending the bad SQL he used to defend his argument. But I didn't make a big deal out of it.
strbean · 41m ago
I think that poster is asking you to answer this question:
> I'm not sure there's a better option in Postgres, and I'm not convinced you understand his main argument. How would you restate it?
Rather than asking you to demonstrate the better way of writing the query.
mcdonje · 12m ago
I know. I answered that in my second paragraph. They also asked how I'd approach writing the query after I already mentioned the issues and sent them the postgres doc page.
kragen · 2m ago
You would restate his main argument as "I kind of thought it was a red flag that you were accusing me of not understanding the post, which is about how SQL is bad at hierarchical data, while defending the bad SQL he used to defend his argument. But I didn't make a big deal out of it."? That doesn't make any sense.
I think you don't understand what the post is about, you don't understand what the query is trying to achieve, you don't know a better way to do it, and you're just engaging in ego defense instead of contributing anything.
Mikhail_Edoshin · 11h ago
There was a discussion about a relevant paper here at HN:
I am becoming more and more comfortable with storing everything in JSON.
With its JSON arrow operators and indexed expressions, SQLite makes a really nice JSON document store.
Assume you have a "cars" table with nothing but a "cars" column that is a JSON blob. Then selecting the models of all cars is just:
SELECT cars->>'model' FROM cars
I wish MariaDB would follow suit and also support arrow operators. You can do it in MariaDB, but it becomes more cumbersome:
SELECT JSON_UNQUOTE(JSON_EXTRACT(cars, '$.model')) FROM cars
tgv · 11h ago
If that's all, I still recommend classical columns. Now if you have array values and/or values with varying structures which you don't need to query (frequently), JSON makes sense.
mr_toad · 11h ago
> SELECT cars->>'model' FROM cars
Can ‘model’ be a variable, or does it have to be a constant literal?
When something in SQL becomes cumbersome or difficult to work with, people often tend to create entirely new solutions rather than improving the existing ones. Few developers seem interested in refining SQL itself.
bmn__ · 7h ago
The hurdle to entry is too damn high. If you want to have your day ruined, let an ISO WG participant explain to you the social and monetary capital needed just to be able to make the most timid of a suggestion for refining the language.
phiresky · 9h ago
I think relational databases are great, but this is my biggest problem with SQL, even before the ridiculous syntax.
A query like
select users.*, orders.* from users left join orders on orders.user_id = users.id
Should always have returned a structure like:
type SingleReturnRow = { users: {id: ..., }[], orders: {user_id: ..., id: ..., }[]}
type Return = SingleReturnRow[]
Mangling the columns together and _removing_ groupings that naturally appear is just so unnecessary.
I don't think a larger change in the query language would even be needed.
Even better of course would be a return value like
type SingleReturnRow = { user: User, orders: Order[] }
But I see how that would require a fundamental change in the language.
Of course in PG now you can use
select users.*, json_agg(orders.*) as orders from users left join orders on orders.user_id = users.id group by users.id
but using JSON as intermediate steps just feels unnatural.
johannes1234321 · 9h ago
It follows the relational algebra model. Relations (aka Tables) go in, relations (aka Tables) come out. This makes some things really nice.
However I proposed a hierarchical result for such cases a long time to our database, but couldn't convince enough people. json_agg came later at there all the machinery is there, it would "just" require exposing this to the protocol and adapting all clients to understand that data format ...
andyferris · 9h ago
There's nothing in the relational model that suggests a field (cell) can't take the value of a relation. Only SQL makes that difficult.
johannes1234321 · 6h ago
There is nothing forbidding it, but then you can't process it further with the same algebra, that value then is a single opaque value. (Which for many uses is fine as that should be one of the final steps of processing)
This is a problem that ActiveRecord-style ORM's have failed to solve. One possible solution is writing queries like in the article that transform normalized data from the DB into a JSON structure. Another possibility is to design an ORM that is actually capable of taking a set of relations and dynamically building queries with joins, then transforming the returned rows into tree like structures.
This is a very interesting area for exploring a new kind of ORM. I find the whole "CRUD monkey" shtick preached by DHH et al quite lacking for expressing the kind of hierarchical data structures developers need.
mamcx · 6h ago
The things `SQL` fail to do that enable this sort of things are:
* Store relation-on-cell: All the trick of so called `nosql` is that they can do nested data. THAT IS ALL
* Then, `join` is already inline
* Then, instead of the very poorly named `GROUP BY` it could has REAL `GROUP BY`!
That is all is need at the core
hbrundage · 10h ago
I agree and would take it one step further — the structure of the joins is something that should most often come from the schema, not the query. In the same way that the attributes of an entity should be modelled out ahead of time, the relationships between the entities should be as well, and that yields more productive querying and better performance.
I disagree. From a fact-oriented modeling perspective, relationships among entities are already well-modelled and represented in tables. There's a reason the relational model calls tables relations - they relate things. The mistake here is viewing tables/rows as representing entities, FK constraints as representing relationships, and viewing entities as containers of attributes rather than the subjects of facts.
Joining tables is composing complex facts from simple ones, and is the opposite of normalization which is decomposing complex facts into simpler ones. The ability to join tables on arbitrary conditions is fundamental to the ability to ask a DBMS complex questions and have it respond with all the facts that match that question.
ivanb · 9h ago
One limitation of JSON is its limited set of types. For example, for decimal numbers one has to resort to stringly typing representation because DB connection libraries assume that JSON numbers are floating point. Note that JSON numbers are just sequences of digits, nothing more. There is no attached precision semantic.
Another example is UUIDs. Instead of transferring 16 bytes, the libraries deal with wasteful string representation. I'm sure you can bring another examples.
Nonetheless, for majority of data JSON as DB output format is alright.
Spixel_ · 9h ago
I don't get the transaction bit. At least with postgres, a transaction doesn't guarantee that all statements in it see the data at the same point in time (actually, it's not even guaranteed for subqueries).
Also, often, the transactional database servers is more difficult to scale than application servers so from a technical standpoint, it makes sense to do this glue work in app code.
andyferris · 9h ago
> a transaction doesn't guarantee that all statements in it see the data at the same point in time
This depends on the transaction isolation level. If you use snapshot or serializable this should be the case (but you may have aborted transactions due to optimistic concurrency).
Spixel_ · 6h ago
You are right, but note that the default isolation level is "Read committed" in postgres.
TheTaytay · 9h ago
Doesn’t it guarantee consistency from the time the transaction started (assuming read committed isolation)? It guarantees you won’t see something “later” than when your transaction began.
I’m likely misunderstanding what you mean by time.
Spixel_ · 6h ago
Read committed (which is the default), doesn't guarantee that.
See "Nonrepeatable Read" and "Phantom Read" which are both possible in your documentation page.
Jweb_Guru · 8h ago
SSI scales fine for most workloads as long as you correctly mark your read-only transactions.
zigzag312 · 10h ago
I think data definition schema needs to have better type system which not only defines database schema, but contains all necessary data and allows annotations to be able to generate quality language specific code from it. Single source of truth for data structure, which is used both by database and code.
kevindamm · 10h ago
I see some value in that, but in my experience it's actually the queries that inform much of what the table shema and indexes should look like. Especially when joins and/or ordering are involved, it's the details about what the application wants to extract from the DB that are important, and I'd rather have those details at the code's call site not at the schema definition, because if/when they change or are removed it will be clearer from the context.
zigzag312 · 9h ago
What I'm proposing doesn't change that. It's just to information necessary to keep strongly typed data in sync between the database and dtos. To help with serialization and mapping. Compile time type safety.
When querying db for schema you don't get enough information to be able to generate very good code in all cases.
Language-first way (define schema in language and generate database schema/migrations) ties you to just one language. And usually these tools don't support using all database features.
redwood · 10h ago
This is the core reason why MongoDB is popular. And while the HN crowd fell into a trough of dissolutionment with it 10 years ago since it wasn't ready then, a bunch of folks found success with it more recently. At my company we had legacy implementations of it and modern ones and the juxtaposition was significant.
No comments yet
roenxi · 11h ago
The real counterargument here is that the DB systems that stored nested data mostly lose in competition with relational systems over time as it turns out they don't satisfy the large number of people who use data for things other than UIs - they aren't suitable for long term storage and don't separate concerns in a way that allows generalists like database engineers to optimise how data gets too and from a storage location into memory.
And, in practice a lot of these frontenders end up storing a JSON blob with either no nesting or just one level of nesting that looks a lot like a database schema built by someone being stubborn about not normalising data. In some sense that is fine, databases seem to be learning to just treat simple JSON blobs as normalised data but it is hard to make that as efficient as a guaranteed schema. Plus larger companies often end up having to hire a dedicated team of engineers to properly normalise data so other people can use it.
tucnak · 12h ago
> All that's left to do now is... the same joins, but inside the backend web server. Because we have to re-assemble these flat outputs into the structure of the page.
This is NOT the case with modern SQL, as it supports JSON. The typical Postgres use-case would be to produce denormalised views for the specific relationships and/or aggregates in JSON form. LATERAL, the correlated sub-query, is really convenient in avoiding double-counting. The rest of the article deals with object-mapping, which is really a poor man's VIEW. Time and time again, software engineers reinvent the data-access wheel out of ignorance to database capabilities.
andyferris · 9h ago
JSON is very helpful here, and it _happens_ [1] that our HTTP APIs tend to respond with JSON, but I feel SQL's data model is much better suited. The difference is that JSON is hierarchical but SQL relations are flat. However if we let a relation be contained as a value (in a field/cell) then we should be able to produce our hierarchical query result in a natural way (and possibly map that through a `toJSON` method on the API server, if the client wants application/json... otherwise the server might want to do grpc or whatever and getting JSON from the DB is a roundabout pain in the bum).
[1] Actually causality is backwards here - postgres supports JSON only because the Restful API servers that frequently interact with it need to provide JSON to _their_ clients... and so it was a highly demanded feature.
taffer · 12h ago
This is also the conclusion of the article. The author then shows an example of how to get hierarchical json out of your relational database.
bob1029 · 8h ago
> Time and time again, software engineers reinvent the data-access wheel out of ignorance to database capabilities.
Much of this discourse around SQL "not having structure" seems to be about arrogance rather than ignorance. It would take 10 seconds with ChatGPT to resolve this deficit, but for some reason we insist on writing entire blogs and burning hundreds of aggregate hours commenting about a make-believe world wherein views and CTEs don't exist.
MaxMonteil · 12h ago
This sounds very interesting, I've been using SQL more lately and am constantly impressed by what it can just do.
My thinking is everything I could get done by the DB avoids heavier and maybe slower application code.
Do you have some resources or material I could check to learn more?
Personally I am using an architecture similar to https://sive.rs/pg in my personal projects and I am very happy with it. It is important that you put data and procedures/views in different schemas so that you can use migrations for your data but automatically delete and recreate the procedures/views during deployment. Also use pgtap or something similar for testing.
reaanb2 · 3h ago
That's a cool approach that could work well if you don't need realtime data validation such as in a UI. I would love to find a solution that allows the same validation rules to be used in the DBMS as well as in the backend and frontend code.
MaxMonteil · 10h ago
Oh nice! I had read this Sivers post a long time ago, time for a refresher.
Appreciate the info, thanks!
bbkane · 8h ago
I like this point of view but putting logic in the database also has downsides - the tooling in particular is bad. No debugger, hard to write tests, can't canary changes, bad or non-existent autocompletion, lack of linters, etc.
MaxMonteil · 5h ago
The tooling keeps getting better so I'm bullish on that side.
For canaries there's growing support for branching the database which can help.
But in the end, this like all things requires balance. Putting everything in there can lead to pain down the road, for example I wouldn't follow the http part.
My use case is with app's local sqlite and I have a lot of code transforming the returned rows into JSON. It works but feels slower and to divorced from the data.
fredguth · 12h ago
Today I Learned the power of LATERALs... thanks. :-)
silon42 · 10h ago
I've seen 'decorellation' mentioned... 95% of the time I am fighting the query optimizer is to get the correlation and a loop over few items working instead of a table scan.
maweki · 11h ago
Isn't the author just proposing GraphQL?
andyferris · 8h ago
Not at all.
GraphQL has the API servers (database clients) marshalling the data into heirarchies (nested data) and managing consistent results (e.g. streaming updates).
The article is arguing that the database should be more capable of understanding queries that request nested data (and separately Jamie has spoken in favour of databases that encourage end clients getting up-to-date/internally-consistent results).
My two cents is that it's a crutch having SQL rely on JSON instead of it's own first-class nested data (which would be any given value (field, aka cell) can itself be a relation - at least postgres has array values but they are challenging to index/use keys programmatically).
h4ch1 · 10h ago
I mean for highly connected data shouldn't the solution be to ideally use a graph database from the get go instead of generating JSON aggregates in the database?
Don't get me wrong, JSONB is great but I see a lot of people simply slap on a JSON column to anything that requires more complexity without trying to first solve it through RDBMS relations.
zozbot234 · 11m ago
PostgreSQL makes a perfectly fine graph database, at least for general workloads. You're not going to get vastly improved performance from a specialized solution unless you're doing very specific things like math-heavy network analysis that PostgreSQL is still a bad fit for.
nrjames · 3h ago
I loaded this same IMDB dataset into a Kuzu DB the other day, just for fun. It was an interesting project but I would agree that this dataset in particular lends itself to a graph database. Kuzu started to choke with > 10M nodes & 10M edges but I was doing a naive insert & probably could fit the entire dataset in without trouble using their direct CSV import.
jbverschoor · 10h ago
Time to start taking that those CS classes before blogging
kjfarm · 10h ago
This response comes across as unnecessarily toxic and quip-y, which is a shame because there's real substance here. The author clearly understands database design, referencing concepts like normal forms and Boyce-Codd Normal Form (BCNF). Even if I don't agree with the conclusion, I have to respect anyone who takes the time to lay out their argument. Challenging ideas are what help our community grow.
kerbobotat · 10h ago
Time to start proofreading your comments before posting.
eatonphil · 9h ago
Guessing from the relevant masters thesis linked on his site and his time at materialize and tigerbeetle, Jamie's been working on databases for at least like 15 years?
At least personally when Jamie says something I listen. :)
This pattern has saved me from some truly awful query logic.
I'd expect to talk about anything in a DB as "structured data". Flexible serialization formats such as JSON or XML are "semi-structured". And something from e.g., an office document is "unstructured".
This is not a complaint or criticism. I understand the article just fine. It's just interesting how different perspectives can affect what words mean (:
[1] https://martin.kleppmann.com/2017/03/27/designing-data-inten...
Mark Rosewater likes to write about his personal concept of "linear" Magic: the Gathering decks, which are decks in which the cards tend to pick up synergy bonuses from each other, so that having two of them together is considerably more powerful than you'd expect from the strength of those two cards individually.
This always bothers me because it is the opposite of the more normal use of "linear" relationships, in which everything contributes exactly as much to an aggregate as it's worth individually.
SQL, as a language is clunky, true. It can be patched here are there, either by PipeSQL or by various ORMs. I agree, that it would be wonderful to have standardized tooling for generating JSON like in the post.
Yet, with relational databased you can separate concerns of: what is your data and what you want to display. If you use JSON-like way to store data, it can do the job until you want to change data or queries.
The shame, to me, is that SQL is _unnecessarily_ clunky with producing query results with nested/heirarchical data. The relational model allows for any given value (field or cell value) to be itself a relation, but SQL doesn't make it easy to express such a query or return such a value from the database (as Jamie says - often the API server has to "perform the join again" to put it in nested form, due to this limitation).
Property Graph Query (PGQ) is now a part of SQL and is expected to help with expressing these complex queries.
First normal form explicitly forbids nested relations though. Relational algebra does not support nested relations for this reason.
But perhaps nesting relations might make sense as the final step, just like sorting, which is not supported by the pure relational model either.
Usually, relational algebra doesn't have many restrictions about the type of the atomic values it deals with, what makes sequences and relations perfectly valid candidates.
But yeah, there are many reasons for you to normalize your data at rest.
Also, modern SQL is an incredibly powerful language. Good SQL can save you from lots of dumb data munging code if you know how to wield your database properly. Especially for analytical queries, but also more typical application code IMO.
And in fact, good constraints also can improve query performance. If the optimizer knows this column is unique, or that column is guaranteed to have a corresponding value in the joined table, it can do all sorts of tricks...
Something else I find confusing is that every developer seems to want every database query for a single object to return a single row with a thousand columns, and anything multi row or with multiple results is way too complicated to handle. This goes double for report writing software.
I really wonder what we're doing with database providers (drivers) that makes people want to develop ORMs when ORMs consistently feel like 10 ton gorillas. If the database world is so disparaging of the "row by excruciating row" processing, why do the drivers exclusively present data that way and no other?
I think pushing multiple joins to what is still a relational database and getting a complex output isn't the worst idea in the world, as a higher-level layer on top of a regular database.
On the other hand, "it needs four queries/RTTs" is not the worst thing in the world. It needn't be the goal of a system to achieve theoretical minimum performance for everything.
Let those who truly have the problem in prod push the first patch.
Codd was right in that if you want transactional semantics that are both quick and flexible, you'll need to _store_ your data in normalized relations. The system of record is unwieldly otherwise.
The article is right that this idea was taken too far - queries do not need to be restricted to flat relations. In fact the application, for any given view, loves heirarchical orginization. It's my opinion that application views have more in common with analytics (OLAP) except perhaps latency requirements - they need internally consistent snapshots (and ideally the corresponding trx id) but it's the "command" in CQRS that demands the normalized OLTP database (and so long as the view can pass along the trx id as a kind of "lease" version for any causally connected user command, as in git push --force-with-lease, the two together work quite well).
This issue is of course that SQL eshews hierarchical data even in ephemeral queries. It's really unfortuante that we generate jsonb aggregates to do this instead of first-class nested relations a la Dee [1] / "third manifesto" [2]. Jamie Brandon has clearly been thinking about this a long time and I generally find myself nodding along with the conclusions, but IMO the issue is that SQL poorly expresses nested relations and this has been the root cause of object-relation impedence since (AFAICT) before either of us were born.
[1] https://github.com/ggaughan/dee [2] https://www.dcs.warwick.ac.uk/~hugh/TTM/DTATRM.pdf
The biggest mistake was thinking we could simply slap a network on top of SQL and call it a day. SQL was originally intended to run locally. You don't need fancy structures so much when the engine is beside you, where latency is low, as you can fire off hundreds of queries without thinking about it, which is how SQL was intended to be used. It is not like, when executed on the same machine, the database engine is going to be able to turn the 'flat' data into 'rich' structures any faster than you can, so there is no real benefit to it being a part of SQL itself.
But do that same thing over the network and you're quickly in for a world of hurt.
Codd was writing 10 years before the idea of transactional semantics was formulated, and transactions are in fact to a great extent a real alternative to normalization. Codd was working to make inconsistent states unrepresentable in the database, but transactions make it a viable alternative to merely avoid committing inconsistent states. And I'm not sure what you mean by "quick", but anything you could do 35 years ago in 10 milliseconds is something you can do today in 100 microseconds.
It's specifically about _fast_ transactions in the OLTP context. When talking about the 1970s (not 1990s) and tape drives, rewriting a whole nested dataset to apply what we'd call a "small patch" nowadays wasn't a 10 millisecond job - it could feasibly take 10s of seconds or minutes or hours. That a small patch to the dataset can happen almost instantly - propagated to it's containing relation, and a handful of subordinate index relations - was the real advance in OLTP DBs. (Of course this never has and never will help with "large patches" where the dataset is mostly rewritten, and this logic doesn't apply to the field of analytics).
Perhaps Codd "lucked out" here or perhaps he didn't have the modern words to describe his goal, but nonetheless I think this is why we still use flat relations as our systems of record. Analytical/OLAP systems do vary a lot more!
I think that if you're processing your transactions on tape drives, your TP isn't OL; it's offline transaction processing.
I think Codd's major goal was decoupling program structure from on-disk database structure, not improving performance. There's a lot of the history I don't know, though.
So one can technically create a projection/view that is tailor-made for a query that needs to display some data. Of course it is no often possible to retrieve all the data with a single select command.
So joins and multiple queries are simply inherent to complexity of data we store nowadays.
Anyway, years ago, i have moved to a db model where every entity is stored as a blob in a dedicated column and every additional column, beside id, is indexed. So there is no wasted space and a ton of columns that only hold data but are not used for filtering. I can run data-efficient queries that yield a list of ids of blobs to load or the blobs themselves and then i extract any necessary data out of those blobs(entities) on the application level. So the database us purely a blob store + few fast indices.
At this time, there are 12 services that make up the entire application. Event-sourcing is what allows infinite scaling and CQRS.
ES is the pinnacle of technology in the web sector, but it comes with a lot of overhead and time to market is significantly slowed down. So it is not something every project can or should consider implementing.
> A relation [table] whose domains [column types] are all simple [not relations] can be represented in storage by a two-dimensional column-homogeneous array of the kind discussed above. Some more complicated data structure is necessary for a relation with one or more nonsimple domains. For this reason (and others to be cited below) the possibility of eliminating nonsimple domains appears worth investigating.⁴ There is, in fact, a very simple elimination procedure, which we shall call normalization.
But non-normalized relations support the kind of nested structure the eminent Dr. Brandon wants, without resorting to JSON or abandoning the strong uniform typing we have with SQL. Darwen & Date's The Third Manifesto https://www.dcs.warwick.ac.uk/~hugh/TTM/DTATRM.pdf includes group and ungroup operations which translate back and forth between normalized and non-normalized relations.
I've been playing around with some of these ideas using some rather horrific atrocities perpetuated against Lua operator overloading in http://canonical.org/~kragen/sw/dev3/cripes.lua. I haven't added joins or ungrouping to it yet, but I think it might be a prototype of something promising for this kind of use case.
Classes/Structs know about their children.
Relations know about their parents. If you want crazier m*n relationships you use an association table.
Did the author just not know? Or he didn't see it worthy of dismissal?
> Doing this transformation by hand is tedious and error-prone. We call this tedium "the object-relational mismatch" but it isn't really about objects or relations.
It really is.
> The fundamental problem is that fitting complex relationships to human vision usually requires constructing some visual hierarchy, but different tasks require different hierarchies.
Yes. Different tasks require different hierarchies. One particular way of doing things should not baked into your 1970s relational model.
Even then, if you really wanted to, you can absolutely make a tree-like structure in SQL in a variety of ways: adjacency lists, nested sets, closure tables, etc.
That's kinda the problem—rather there being 1 way to make a tree-like structure in SQL, that works correctly, there's a lot of ways to do it, and they all have different tradeoffs, and they're all a bit obnoxious.
Like, left joins have been around since SQL-92, so the query not returning a result when there is no director is a skill issue.
Nowadays, you can run a JSON or XML query and have the DB return your results in a hierarchical format.
> - In a small team (averaging ~3 people, I think?) in a little under a year, we built a SQL query planner that passed the >6 million tests in the sqlite logic test suite.
> - I figured out how to decorrelate arbitrary SQL subqueries (I wasn't the first person to figure this out, but I got there independently).
> - I demonstrated a mode of consistency failure in various popular streaming systems that I believe was not widely understood at the time.
I think it is likely that the reason you disagree with him is not that he "don't understand tabular data" and has "a skill issue". Unless you're secretly Michael Stonebraker?
You write:
> Nowadays, you can run a JSON or XML query and have the DB return your results in a hierarchical format.
You may not have noticed this, but the post you are commenting on explains how to run a JSON query and have the DB return your results in a hierarchical format.
I didn't read beyond the article. I only responded to what was in the post.
He didn't use capabilities SQL has had for decades, made a bad query, and used that to build his argument about how bad SQL is.
The ORM tables could've been written in one query that returns a tabular format of hierarchical data, which the front end could use.
What does that look like? It looks like the kind of underlying tabular format JSON & XML queries require. He didn't do that with his non-JSON examples.
Yes, the framework would need to turn it into an object. Yes, there is data duplication in some fields. While not ideal, it's still better than the ORM example because you're not splitting it up into multiple queries.
His JSON example undercuts his main argument. The DB is fully capable of returning the representation he wants. It's also more complex than necessary with all the JSONB_AGGs in the subqueries. Then he says you can see the duct tape.
So yeah, it reads like a front end person who never really learned modern SQL. Or maybe it's someone who knows better who is making a bad faith argument. Either way, it's just a bad take.
For another, postgres has more JSON functions than just jsonb_agg.
https://www.postgresql.org/docs/current/functions-json.html
I don't have time to write something out right now, but there are definitely options.
It's kind of a red flag, though, that you didn't answer my question about how you would restate his main argument. It suggests to me that you're just here to flame instead of to have a productive discussion.
I kind of thought it was a red flag that you were accusing me of not understanding the post, which is about how SQL is bad at hierarchical data, while defending the bad SQL he used to defend his argument. But I didn't make a big deal out of it.
> I'm not sure there's a better option in Postgres, and I'm not convinced you understand his main argument. How would you restate it?
Rather than asking you to demonstrate the better way of writing the query.
I think you don't understand what the post is about, you don't understand what the query is trying to achieve, you don't know a better way to do it, and you're just engaging in ego defense instead of contributing anything.
"What if an SQL statement returned a database?"
https://news.ycombinator.com/item?id=38606146
With its JSON arrow operators and indexed expressions, SQLite makes a really nice JSON document store.
Assume you have a "cars" table with nothing but a "cars" column that is a JSON blob. Then selecting the models of all cars is just:
I wish MariaDB would follow suit and also support arrow operators. You can do it in MariaDB, but it becomes more cumbersome:Can ‘model’ be a variable, or does it have to be a constant literal?
A query like
Should always have returned a structure like: Mangling the columns together and _removing_ groupings that naturally appear is just so unnecessary.I don't think a larger change in the query language would even be needed.
Even better of course would be a return value like
But I see how that would require a fundamental change in the language.Of course in PG now you can use
but using JSON as intermediate steps just feels unnatural.However I proposed a hierarchical result for such cases a long time to our database, but couldn't convince enough people. json_agg came later at there all the machinery is there, it would "just" require exposing this to the protocol and adapting all clients to understand that data format ...
This is a very interesting area for exploring a new kind of ORM. I find the whole "CRUD monkey" shtick preached by DHH et al quite lacking for expressing the kind of hierarchical data structures developers need.
* Store relation-on-cell: All the trick of so called `nosql` is that they can do nested data. THAT IS ALL
* Then, `join` is already inline
* Then, instead of the very poorly named `GROUP BY` it could has REAL `GROUP BY`!
That is all is need at the core
We built a GraphQL / SQL hybrid expression language that does just this here: https://docs.gadget.dev/guides/data-access/gelly#gelly , curious what yall think.
Joining tables is composing complex facts from simple ones, and is the opposite of normalization which is decomposing complex facts into simpler ones. The ability to join tables on arbitrary conditions is fundamental to the ability to ask a DBMS complex questions and have it respond with all the facts that match that question.
Another example is UUIDs. Instead of transferring 16 bytes, the libraries deal with wasteful string representation. I'm sure you can bring another examples.
Nonetheless, for majority of data JSON as DB output format is alright.
Also, often, the transactional database servers is more difficult to scale than application servers so from a technical standpoint, it makes sense to do this glue work in app code.
This depends on the transaction isolation level. If you use snapshot or serializable this should be the case (but you may have aborted transactions due to optimistic concurrency).
https://www.postgresql.org/docs/current/transaction-iso.html
I’m likely misunderstanding what you mean by time.
When querying db for schema you don't get enough information to be able to generate very good code in all cases.
Language-first way (define schema in language and generate database schema/migrations) ties you to just one language. And usually these tools don't support using all database features.
No comments yet
And, in practice a lot of these frontenders end up storing a JSON blob with either no nesting or just one level of nesting that looks a lot like a database schema built by someone being stubborn about not normalising data. In some sense that is fine, databases seem to be learning to just treat simple JSON blobs as normalised data but it is hard to make that as efficient as a guaranteed schema. Plus larger companies often end up having to hire a dedicated team of engineers to properly normalise data so other people can use it.
This is NOT the case with modern SQL, as it supports JSON. The typical Postgres use-case would be to produce denormalised views for the specific relationships and/or aggregates in JSON form. LATERAL, the correlated sub-query, is really convenient in avoiding double-counting. The rest of the article deals with object-mapping, which is really a poor man's VIEW. Time and time again, software engineers reinvent the data-access wheel out of ignorance to database capabilities.
[1] Actually causality is backwards here - postgres supports JSON only because the Restful API servers that frequently interact with it need to provide JSON to _their_ clients... and so it was a highly demanded feature.
Much of this discourse around SQL "not having structure" seems to be about arrogance rather than ignorance. It would take 10 seconds with ChatGPT to resolve this deficit, but for some reason we insist on writing entire blogs and burning hundreds of aggregate hours commenting about a make-believe world wherein views and CTEs don't exist.
My thinking is everything I could get done by the DB avoids heavier and maybe slower application code.
Do you have some resources or material I could check to learn more?
Example code: https://github.com/sivers/store
This talk is more Oracle specific but with some minor changes you can apply the same ideas to postgres: https://www.youtube.com/watch?v=GZUgey3hwyI
Personally I am using an architecture similar to https://sive.rs/pg in my personal projects and I am very happy with it. It is important that you put data and procedures/views in different schemas so that you can use migrations for your data but automatically delete and recreate the procedures/views during deployment. Also use pgtap or something similar for testing.
Appreciate the info, thanks!
For canaries there's growing support for branching the database which can help.
But in the end, this like all things requires balance. Putting everything in there can lead to pain down the road, for example I wouldn't follow the http part.
My use case is with app's local sqlite and I have a lot of code transforming the returned rows into JSON. It works but feels slower and to divorced from the data.
GraphQL has the API servers (database clients) marshalling the data into heirarchies (nested data) and managing consistent results (e.g. streaming updates).
The article is arguing that the database should be more capable of understanding queries that request nested data (and separately Jamie has spoken in favour of databases that encourage end clients getting up-to-date/internally-consistent results).
My two cents is that it's a crutch having SQL rely on JSON instead of it's own first-class nested data (which would be any given value (field, aka cell) can itself be a relation - at least postgres has array values but they are challenging to index/use keys programmatically).
Don't get me wrong, JSONB is great but I see a lot of people simply slap on a JSON column to anything that requires more complexity without trying to first solve it through RDBMS relations.
At least personally when Jamie says something I listen. :)