DELETE FROM task_queue
WHERE id = ( -- Use '=' for a single expected ID
SELECT id FROM task_queue
WHERE queue_group_id = 15
LIMIT 1
FOR UPDATE SKIP LOCKED
)
RETURNING item_id;
I don't understand where that item_id value comes from, since that's not a column that's mentioned anywhere else in the query.
I guess it must be an unmentioned column on that task_queue table?
dragonwriter · 18m ago
In DELETE ... RETURNING, RETURNING works like SELECT in a read query, so, yes, item_id is a column in task_queue and the result set is the item_id value of the row deleted.
eknkc · 14m ago
I don’t know about “gotcha”. This sounds like a full blown planner bug to me.
wordofx · 2h ago
Why was the author not just doing returning *
No need for the CTE to select everything…
tomnipotent · 1h ago
Could be a generational thing. There was a time not every query planner would properly eliminate columns based on the final projection and any intermediary copies required more I/O if you included everything. Even now I still find myself following "best practices" for things that haven't been relevant in a decade.
swid · 1h ago
This is interesting - I don’t expect undefined behavior like this. Should the query delete more than one row or not?
The blog post doesn’t really give answers - it isn’t obvious to me that the second query can’t be executed in the exact same way. The even cop to this fact:
> This structure appears less ambiguous to the planner and typically encourages it to evaluate the subquery first.
So then - their bug still exists maybe?
I have thoughts - probably we do expect it never should delete multiple rows, but after considering the query plan, I can see it being ambiguous. But I would have expected Postgres to use only a single interpretation.
dankebitte · 1h ago
It's not undefined behavior, it's the fact that the uncorrelated subquery within the CTE doesn't specify an ordering, therefore it cannot be implicitly materialized / evaluated once. Postgres documentation is clear here [1]:
> If sorting is not chosen, the rows will be returned in an unspecified order.
The original query from TFA could've instead just had the uncorrelated subquery moved into a materialized CTE.
I see what you saying, but it is very subtle, wouldn’t you agree?
Under a different plan, the inner query would only be evaluated once, it is hard to mentally parse how it will first find the rows with the group id and then pass into the sub query.
And still I am not sure how using a CTE or not in the manner in the post is supposed to avoid this issue, so now I’m a bit skeptical it does. I see how a sort would.
I hope if the sub query was its own CTE, the limit would be applied correctly, but am no longer sure… before this post I wouldn’t have questioned it.
Edit to say - now I see you need to explicitly use AS MATERIALIZED if you bump the subquery to a CTE. Someone should really write a better blog post on this… it raises an interesting case but fails to explain it… they probably have not even solved it for themselves.
Horffupolde · 1h ago
IN is not recommended. Use = ANY instead.
codesnik · 57m ago
aren't they the same in Postgres?
abhisek · 2h ago
Interesting to think about how to guard against these cases where query optimisation leads to unexpected results.
I mean this could lead to serious bugs. What can be a way to detect these using linters or in CI before they hit the production.
magicalhippo · 1h ago
On a somewhat related note...
We've moved to MSSQL due to several reasons including customer demand.
We're experiencing the MSSQL query planner occasionally generating terrible plans for core queries which then gets cached, leading to support calls.
Workaround for now has been to have our query geneator append a fixed-value column which and have it change the value every 10 minutes, as a cache defeat.
Still, surprised the engine doesn't figure this out itself, like try regenerating plans frequently if they contain non-trivial table scans say.
Or just expire cache entries every 15 minutes or so, so a bad plan doesn't stick around for too long.
I guess it must be an unmentioned column on that task_queue table?
No need for the CTE to select everything…
The blog post doesn’t really give answers - it isn’t obvious to me that the second query can’t be executed in the exact same way. The even cop to this fact:
> This structure appears less ambiguous to the planner and typically encourages it to evaluate the subquery first.
So then - their bug still exists maybe?
I have thoughts - probably we do expect it never should delete multiple rows, but after considering the query plan, I can see it being ambiguous. But I would have expected Postgres to use only a single interpretation.
> If sorting is not chosen, the rows will be returned in an unspecified order.
The original query from TFA could've instead just had the uncorrelated subquery moved into a materialized CTE.
[1] https://www.postgresql.org/docs/current/queries-order.html
Under a different plan, the inner query would only be evaluated once, it is hard to mentally parse how it will first find the rows with the group id and then pass into the sub query.
And still I am not sure how using a CTE or not in the manner in the post is supposed to avoid this issue, so now I’m a bit skeptical it does. I see how a sort would.
I hope if the sub query was its own CTE, the limit would be applied correctly, but am no longer sure… before this post I wouldn’t have questioned it.
Edit to say - now I see you need to explicitly use AS MATERIALIZED if you bump the subquery to a CTE. Someone should really write a better blog post on this… it raises an interesting case but fails to explain it… they probably have not even solved it for themselves.
I mean this could lead to serious bugs. What can be a way to detect these using linters or in CI before they hit the production.
We've moved to MSSQL due to several reasons including customer demand.
We're experiencing the MSSQL query planner occasionally generating terrible plans for core queries which then gets cached, leading to support calls.
Workaround for now has been to have our query geneator append a fixed-value column which and have it change the value every 10 minutes, as a cache defeat.
Still, surprised the engine doesn't figure this out itself, like try regenerating plans frequently if they contain non-trivial table scans say.
Or just expire cache entries every 15 minutes or so, so a bad plan doesn't stick around for too long.