A PostgreSQL planner semi-join gotcha with CTE, LIMIT, and RETURNING

41 namanyayg 26 5/4/2025, 1:24:14 AM shayon.dev ↗

Comments (26)

impulsivepuppet · 1h ago
Since I don't often write raw SQL, I can only assume the author named their CTE `deleted_tasks` to elucidate that the query might delete multiple items. Otherwise, it makes little sense, for they intended to "pop" a single row, and yet their aptly named `deleted_tasks` ended up removing more than one!

The query reads to me like a conceptual mish-mash. Without understanding what the innermost `SELECT` was meant to accomplish, I'd naturally interpret the `WHERE id IN (...)` as operating on a set. But the most sacrilegious aspect is the inclusion of `FOR UPDATE SKIP LOCKED`. It assumes a very specific execution order that the query syntax doesn't actually enforce.

Am I right to think that not avoiding lock contention, i.e. omitting `SKIP LOCKED` would have actually produced the intended result?

eknkc · 4h ago
I don’t know about “gotcha”. This sounds like a full blown planner bug to me.
swid · 4h ago
I think arguably there is no bug here, but the blog doesn't do a good job of explaining the issue and their fix may work because they remove the CTE and switch to `=`, but they prove they don't understand why the difference when they suggest "using id IN (...) might still be necessary"; if you do that, then the problem will return.

There are two factors here.

The subquery

    SELECT id FROM task_queue WHERE queue_group_id = 15 FOR UPDATE SKIP LOCKED LIMIT 1
can return different ids each time you run it. If it was ordered, then it would always return the same id and if postgres optimized in a way that it runs more than once it would just get the same result each time anyway.

Otherwise, you need to force postgres to evaluate your subquery exactly once by materializing it. There are different ways this might be accomplished - the blog post does this incidentally by using `=`. But it is not the only way to tell postgres that.

For instance, like this. But it is fragile - without AS MATERIALIZED, it could be run more than once.

    WITH candidate AS MATERIALIZED (
      SELECT id FROM task_queue WHERE queue_group_id = 15 FOR UPDATE SKIP LOCKED LIMIT 1
    )
    DELETE FROM task_queue t USING candidate c WHERE t.id = c.id
    RETURNING t.item_id;
clhodapp · 1h ago
Hmm, it seems like the subquery is getting re-run on every single row of the task queue table, which seems like a performance issue in addition to correctness.

Personally, if I were going to put any part of that query in a CTE, it would have been (just) the select, as I generally prefer CTE's to subqueries.

I'm really not sure what motivated the author to create the CTE in the first place, as the final select seems to be essentially the identity function.

Normal_gaussian · 12m ago
"I'm really not sure what motivated the author to create the CTE in the first place, as the final select seems to be essentially the identity function."

We're likely seeing a simplified example. All queries should be able to be embedded in a CTE without integrity loss.

McGlockenshire · 1h ago
> it seems like the subquery is getting re-run on every single row of the task queue table

Y'all might find the term "correlated subquery" helpful here.

When I want them, I always get them. Sometimes I get them when I don't want them. Sometimes it's even my fault, not the result of the query planner looking at me sideways.

wordofx · 6h ago
Why was the author not just doing returning *

No need for the CTE to select everything…

tomnipotent · 5h ago
Could be a generational thing. There was a time not every query planner would properly eliminate columns based on the final projection and any intermediary copies required more I/O if you included everything. Even now I still find myself following "best practices" for things that haven't been relevant in a decade.
simonw · 4h ago

  DELETE FROM task_queue
  WHERE id = ( -- Use '=' for a single expected ID
    SELECT id FROM task_queue
    WHERE queue_group_id = 15
    LIMIT 1
    FOR UPDATE SKIP LOCKED
  )
  RETURNING item_id;
I don't understand where that item_id value comes from, since that's not a column that's mentioned anywhere else in the query.

I guess it must be an unmentioned column on that task_queue table?

dragonwriter · 4h ago
In DELETE ... RETURNING, RETURNING works like SELECT in a read query, so, yes, item_id is a column in task_queue and the result set is the item_id value of the row deleted.
swid · 5h ago
This is interesting - I don’t expect undefined behavior like this. Should the query delete more than one row or not?

The blog post doesn’t really give answers - it isn’t obvious to me that the second query can’t be executed in the exact same way. The even cop to this fact:

> This structure appears less ambiguous to the planner and typically encourages it to evaluate the subquery first.

So then - their bug still exists maybe?

I have thoughts - probably we do expect it never should delete multiple rows, but after considering the query plan, I can see it being ambiguous. But I would have expected Postgres to use only a single interpretation.

dankebitte · 5h ago
It's not undefined behavior, it's the fact that the uncorrelated subquery within the CTE doesn't specify an ordering, therefore it cannot be implicitly materialized / evaluated once. Postgres documentation is clear here [1]:

> If sorting is not chosen, the rows will be returned in an unspecified order.

The original query from TFA could've instead just had the uncorrelated subquery moved into a materialized CTE.

[1] https://www.postgresql.org/docs/current/queries-order.html

swid · 4h ago
I see what you saying, but it is very subtle, wouldn’t you agree?

Under a different plan, the inner query would only be evaluated once, it is hard to mentally parse how it will first find the rows with the group id and then pass into the sub query.

And still I am not sure how using a CTE or not in the manner in the post is supposed to avoid this issue, so now I’m a bit skeptical it does. I see how a sort would.

I hope if the sub query was its own CTE, the limit would be applied correctly, but am no longer sure… before this post I wouldn’t have questioned it.

Edit to say - now I see you need to explicitly use AS MATERIALIZED if you bump the subquery to a CTE. Someone should really write a better blog post on this… it raises an interesting case but fails to explain it… they probably have not even solved it for themselves.

porridgeraisin · 3h ago
It doesn't matter. The output of a query cannot depend on the plan. All plans should generate semantically equivalent output necessarily. Notice I say semantically equivalent because obviously Select random() can return different numbers each time. But it should still be semantically, one single random number.

In this case, the number of rows changing given the same input dataset, is a bug.

swid · 3h ago
It is allowed to be a planner choice though! Very surprising, but if you understand the docs, it will follow that it is not actually a bug. It could change in the future and has changed in the past apparently - but that change was to give the planner more opportunity to optimize queries by not materializing parts of the query and inlining that part into the parent.

Regarding selects: [0]: “A key property of WITH queries is that they are normally evaluated only once per execution of the primary query… However, a WITH query can be marked NOT MATERIALIZED to remove this guarantee. … By default, a side-effect-free WITH query is folded into the primary query if it is used exactly once in the primary query’s FROM clause.

Regarding CTEs: [1]: “A useful property of WITH queries is that they are normally evaluated only once per execution of the parent query… However, the other side of this coin is that the optimizer is not able to push restrictions from the parent query down into a multiply-referenced WITH query"

Now, in either case - if you don't want the planner to inline the query - you might have to be explicit about it (I think since postgres 10?), or otherwise - yes, the output of the query will depend on the plan and this is allowed based on the docs.

[0]: https://www.postgresql.org/docs/current/sql-select.html

[1]: https://www.postgresql.org/docs/current/queries-with.html

swid · 2h ago
I wrote this to a deleted comment, but even if the CTE was materialized, the subquery of the CTE would still not be...

For instance, with the stand alone query:

  DELETE … WHERE id IN (
    SELECT id … LIMIT 1 FOR UPDATE SKIP LOCKED
  )
  
the planner is free to turn that IN ( subquery ) into a nested‐loop semi‐join, re-executing the subquery as many times as it deems optimal. Therefore it can delete more than 1 row.
Horffupolde · 5h ago
IN is not recommended. Use = ANY instead.
codesnik · 5h ago
aren't they the same in Postgres?
frollogaston · 3h ago
Yes, according to the manual, IN is equivalent to = ANY() https://www.postgresql.org/docs/current/functions-subquery.h...

I had to check because for some reason, I always thought =ANY was somehow better than IN.

masklinn · 3h ago
It is, but in a pretty minor way: any can be used with no items, IN can not and will error.
tczMUFlmoNk · 2h ago
ANY can be used with arrays, particularly query parameters: `id = ANY($1::int[])`, but not `id IN $1::int[]`.
abhisek · 6h ago
Interesting to think about how to guard against these cases where query optimisation leads to unexpected results.

I mean this could lead to serious bugs. What can be a way to detect these using linters or in CI before they hit the production.

magicalhippo · 5h ago
On a somewhat related note...

We've moved to MSSQL due to several reasons including customer demand.

We're experiencing the MSSQL query planner occasionally generating terrible plans for core queries which then gets cached, leading to support calls.

Workaround for now has been to have our query geneator append a fixed-value column which and have it change the value every 10 minutes, as a cache defeat.

Still, surprised the engine doesn't figure this out itself, like try regenerating plans frequently if they contain non-trivial table scans say.

Or just expire cache entries every 15 minutes or so, so a bad plan doesn't stick around for too long.

baq · 55m ago
It’s one of RDBMS peculiarities that you just need to learn about. Any half decent MSSQL course or book will tell you about the plan cache and the occasional gotcha exactly like yours. Time wasted debugging this issue by the unaware is probably in kilo man years by now.

…but MSSQL is still a fantastic database, if you can afford it. Postgres and mysql come with their own set of gotchas, some of which need an actually decent book to be explained. (Note all the RDBMS manuals are decent books and everyone without exception should read at least the TOC of the db they’re using, which IME is still a rarity.)

tucnak · 2h ago
DELETE in SKIP LOCKED scenario is recipe for failure. God/Postgres gave you transactions for atomicity... only UPDATE w/ SKIP LOCKED, holy brother in Christ, and only DELETE completed normally.
baq · 51m ago
God gave us Postgres to teach humanity that everything is an INSERT.