Show HN: PgDog – Shard Postgres without extensions
Here’s a walkthrough of how it works: https://www.youtube.com/watch?v=y6sebczWZ-c
Running Postgres at scale is hard. Eventually, one primary isn’t enough at which point you need to split it up. Since there is currently no good tooling out there to do this, teams end up breaking their apps apart instead.
If you’re familiar with PgCat, my previous project, PgDog is its spiritual successor but with a fresh codebase and new goals. If not, PgCat is a pooler for Postgres also written in Rust.
So, what’s changed and why a new project? Cross-shard queries are supported out of the box. The new architecture is more flexible, completely asynchronous and supports manipulating the Postgres protocol at any stage of query execution. (Oh, and you guessed it — I adopted a dog. Still a cat person though!)
Not everything is working yet, but simple aggregates like max(), min(), count(*) and sum() are in. More complex functions like percentiles and average will require a bit more work. Sorting (i.e. ORDER BY) works, as long as the values are part of the result set, e.g.:
SELECT id, email FROM users
WHERE admin = true
ORDER BY 1 DESC;
PgDog buffers and sorts the rows in memory, before sending them to the client. Most of the time, the working set is small, so this is fine. For larger results, we need to build swap to disk, just like Postgres does, but for OLTP workloads, which PgDog is targeting, we want to keep things fast. Sorting currently works for bigint, integer, and text/varchar. It’s pretty straightforward to add all the other data types, I just need to find the time and make sure to handle binary encoding correctly.All standard Postgres features work as normal for unsharded and direct-to-shard queries. As long as you include the sharding key (a column like customer_id, for example) in your query, you won’t notice a difference.
How does this compare to Citus? In case you’re not familiar, Citus is an open source extension for sharding Postgres. It runs inside a single Postgres node (a coordinator) and distributes queries between worker databases.
PgDog’s architecture is fundamentally different. It runs outside the DB: it’s a proxy, so you can deploy it anywhere, including managed Postgres like RDS, Cloud SQL and others where Citus isn’t available. It’s multi-threaded and asynchronous, so it can handle thousands, if not millions, of concurrent connections. Its focus is OLTP, not OLAP. Meanwhile, Citus is more mature and has good support for cross-shard queries and aggregates. It will take PgDog a while to catch up.
My Rust has improved since my last attempt at this and I learned how to use the bytes crate correctly. PgDog does almost zero memory allocations per request. That results in a 3-5% performance increase over PgCat and a much more consistent p95. If you’re obsessed with performance like me, you know that small percentage is nothing to sneeze at. Like before, multi-threaded Tokio-powered PgDog leaves the single-threaded PgBouncer in the dust (https://pgdog.dev/blog/pgbouncer-vs-pgdog).
Since we’re using pg_query (which itself bundles the Postgres parser), PgDog can understand all Postgres queries. This is important because we can not only correctly extract the WHERE clause and INSERT parameters for automatic routing, but also rewrite queries. This will be pretty useful when we’ll add support for more complex aggregates, like avg(), and cross-shard joins!
Read/write traffic split is supported out of the box, so you can put PgDog in front of the whole cluster and ditch the code annotations. It’s also a load balancer, so you can deploy it in front of multiple replicas to get 4 9’s of uptime.
One of the coolest features so far, in my opinion, is distributed COPY. This works by hacking the Postgres network protocol and sending individual rows to different shards (https://pgdog.dev/blog/hacking-postgres-wire-protocol). You can just use it without thinking about cluster topology, e.g.:
COPY temperature_records (sensor_uuid, created_at, value)
FROM STDIN CSV;
The sharding function is straight out of Postgres partitions and supports uuid v4 and bigint. Technically, it works with any data type, but I just haven’t added all the wrappers yet. Let me know if you need one.What else? Since we have the Postgres parser handy, we can inspect, block and rewrite queries. One feature I was playing with is ensuring that the app is passing in the customer_id in all queries, to avoid data leaks between tenants. Brain dump of that in my blog here: https://pgdog.dev/blog/multi-tenant-pg-can-be-easy.
What’s on the roadmap: (re)sharding Postgres using logical replication, so we can scale DBs without taking downtime. There is a neat trick on how to quickly do this on copy-on-write filesystems (like EBS used by RDS, Google Cloud volumes, ZFS, etc.). I’ll publish a blog post on this soon. More at-scale features like blocking bad queries and just general “I wish my Postgres proxy could do this” stuff. Speaking of which, if you can think of any more features you’d want, get in touch. Your wishlist can become my roadmap.
PgDog is being built in the open. If you have thoughts or suggestions about this topic, I would love to hear them. Happy to listen to your battle stories with Postgres as well.
Happy hacking!
Lev
Such a cool project, good job Lev!
I've been looking into PgDog for sharding a 40TB Postgres database atm vs building something ourselves. This could be a good opportunity to collaborate because what we need is something more like Vitess for PostgreSQL. The scatter gather stuff is great but what we really need is config management via something like etcd, shard splitting, best-effort transactions for doing schema changes across all shards etc.
Almost totally unrelated but have you had good success using pg_query.rs to re-write queries? Maybe I misunderstood how pg_query.rs works but re-writing an AST seems like a nightmare with how the AST types don't really support mutability or deep cloning. I ended up using the sqlparser crate which supports mutability via Visitors. I have a side project I'm chipping away at to build online schema change for PG using shadow tables and logical replication ala gh-ost.
Jake
I would love to collaborate. Email me: lev@pgdog.dev. Config management is a solved problem, we can use K8s or any number of CD tools. PgDog config reloading can be synchronized.
Best effort transactions for schema changes across shards are working today. Ideally, schema changes are idempotent so it's safe to retry in case of failure. Otherwise, we can try 2-phase commit. It'll need a bit of management to make sure they are not left uncommitted (they block vacuum).
Shard splitting can be done with logical replication. I've done this at Instacart with 10TB+ databases. At that scale, you need to snapshot it with a replication slot open, restore to N instances, delete whatever data doesn't match the shard #, and re-sync with logical replication. Another thing I wanted to try was using Pg 17 logical replication from streaming replicas. I feel like it's possible to parallelize resharding with like 16 replicas, without affecting the primary. In that situation, it might be feasible to just COPY tables through foreign tables with postgres_fdw or PgDog (my choice of sharding function was very intentional). Something to consider.
pg_query.rs seems to be mutable now, as far as I can tell. I've been rewriting and generating brand new queries. I like that it's 100% Postgres parser. That's super important. They have the "deparse" method (struct -> SQL) on pretty much every NodeEnum, so I think it's good to go for doing more complex things.
Lev
I would like to implement cross-shard unique indexes, but they are expensive to check for every query. Open to ideas!
I don’t know that I’d want my sharding to be so transparently handled / abstracted away. First, because usually sharding is on the tenancy boundary and I’d want friction on breaking this boundary. Second, because the implications of joining across shards are not the same as in-shard (performance, memory, cpu) and I’d want to make that explicit too
That takes nothing out of this project, it’s really impressive stuff and there’s tons of use cases for it!
> I’d want friction on breaking this boundary
Why do you want friction?
> implications of joining across shards are not the same
That's usually well understood and can be tracked with real time metrics. Ultimately, both are necessary and alternative solutions, like joining in the app code, are not great.
Because 99% of the time, breaking tenancy boundary is not the right thing to do. Most likely it's a sign that the tenant ID has been lost along the way, and that it should be fixed. Or that the use-case is shady and should be thought about more careful ("what are you _actually_ trying to do" type of thing).
A tenet I truy to stick to is "make the right thing look different (and be easier) than the wrong thing": in this case I think that breaking tenancy boundary should be explicit and more difficult than respecting it (ie sticking to one shard).
That's of course on the assumption that cross-shard queries mean (potentially) cross-tenancy, and that this isn't something that's usually desirable. That's the case in the apps I tend to work on (SaaS) but isn't always the case.
> That's usually well understood
By who? Certainly wouldn't be well-understood by the average dev in the average SaaS company I don't think! Especially if normal joins and cross-shard joins look the exact same, I don't think 90% of devs would even think about it (or know they should think about it).
---
This sounds like negative feedback: it's not! I fully believe that this is a really good tool, I'm really happy it exists and I'll absolutely keep it in my back pocket. I'm saying that the ergonomics of it aren't what I'd (ideally) want for the projects I work on professionally
> Why do you want friction?
Probably because it makes accidental or malicious attempts to leak among tenants harder, therefore less likely.
I think there are a few good solutions for multi-tenant safety. We just need ergonomic wrappers at the DB layer to make them easy to use.
For me the key point in such projects is always handling of distributed queries. It's exciting that pgDog tries to stay transparent/compatible while operating on the network layer.
Of course the limitations that are mentioned in the docs are expected and will require trade-offs. I'm very curious to see how you will handle this. If there is any ongoing discussion on the topic, I'd be happy to follow and maybe even share ideas.
Good luck!
Congrats on the launch Lev, and keep it up!
The benchmarks presented only seem to address standard pooling, I'd like to see what it looks like once query parsing and cross-shard join come into play.
Cross-shard joins will be interesting. I suspect the biggest cost there will be executing suboptimal queries without a good join filter: that's what it takes to compute a correct result sometimes. Also, the result set could get quite large, so we may need to page to disk.
I'm focusing on OLTP use cases first with joins pushed down as much as possible. I suspect cross-shard joins will be requested soon afterwards.
if not, what is the approach to enable restarts without downtime? (let's say one node crashes)?
1. pause traffic 2. reload config 3. resume traffic
This can be done in under a second.
Restarts without downtime can be handled with blue/green using and a TCP load balancer or DNS.
Hot shard management is a job in of itself and adds lot of operational complexity.
I think the first step is to add as much monitoring as possible to catch the problem early. There is also the dry-run mode [1] that you can try before sharding at all to make sure your traffic would be evenly split given a choice of sharding key.
[1] https://pgdog.dev/blog/sharding-a-real-rails-app#dry-run-mod...
What’s the long term (business) plan to keep it updated?
Business plan is managed deployments and support, pretty standard for an infra product I believe.
That said, you probably aren't the first person to ask; so are there similar projects that don't meet all of your criteria or you've not seen anything in that space?
Also, MSSQL is clearly on maintenance mode. Microsoft continues to support it and sale it because $$$$ but it's not a focus.
Unfortunately this advice is incompatible with that of most legal departments.
I get that this is your interpretation, by your interpretation doesn't have any value when it comes to possible IP issues.
lev@pgdog.dev
I see this comment pop up every now and then on HN in specific, but I've never personally had a lawyer tell me this; is there any chance anyone could share an actual example of this?
https://en.wiktionary.org/wiki/pig_dog
GALAHAD: What a strange person.