Love this style of no-fluff technical deep dive. HN needs more content like this.
otterley · 2h ago
> To be an effective alternative to Elasticsearch we needed to support high ingest workloads in real time.
Why not just use OpenSearch or ElasticSearch? The tool is already in the inventory; why use a screwdriver when a chisel is needed and available?
This is another one of those “when you have a hammer, everything looks like your thumb” stories.
iw2rmb · 2h ago
Because you don’t need to sync and you have ACID with joins.
otterley · 2h ago
Is there a whole business to be had with those advantages alone? I’m curious as to who the target market is.
levkk · 1h ago
My last big co, we had a team of 10 who's entire job was to sync data from Postgres into Elastic. It would take weeks and fallover regularly due to traffic.
If we could have a DB that could do search and be a store of record, it would be amazing.
mathewpregasen · 1h ago
Yeah, in general, I think a lot of businesses would love to skip ETL pipelines if possible / consolidate data. Postgres is a very much a neutral database to extend upon, maybe a wild analogy but it's the canola oil of databases
strbean · 35m ago
Total tangent, but I think "Canola is a neutral oil" is a lie. It's got the most distinctive (and in my opinion, bad) flavor of the common cooking oils.
retakeming · 29m ago
What would you say is the most neutral oil then?
gtowey · 41m ago
It's not going to happen anytime soon, because you simply cannot cheat physics.
A system that supports OLAP/ad-hoc queries is going to need a ton of IOPs & probably also CPU capacity to do your data transformations. If you want this to also scale beyond the capacity limits of a single node, then you're going to run into distributed joins and network becomes a huge factor.
Now, to support OLTP at the same time, your big, distributed system needs to support ACID, be highly fault-tolerant, etc.
All you end up with is a system that has to be scaled in every dimension. It needs to support the maximum possible workloads you can throw at it, or else a random, expensive reporting query is going to DOS your system and your primary customer-facing system will be unusable at the same time. It is sort of possible, but it's going to cost A LOT of money. You have to have tons and tons of "spare" capacity.
Which brings us to the core of engineering -- anyone can build a system that burns dump trucks full of venture capital dollars to create the one-system-to-rule-them-all. But businesses that want to succeed need to optimize their costs so their storage systems don't break the bank. This is why the current status-quo of specialized systems that do one task well isn't going to change. The current technology paradigm cannot be optimized for every task simultaneously. We have to make tradeoffs.
otterley · 1h ago
They're different access patterns, though. Are there no concerns about performance and potentially blocking behavior? Decoupling OLTP and analytics is frequently done with good reason: 1/to allow the systems to scale independently, and 2/to help prevent issues with one component from impacting the other (i.e., contain blast radius). I wouldn't want a failure of my search engine to also take down my transaction system.
philippemnoel · 44m ago
You don't need to. Customers usually deploy us on a standalone replica(s) on their Postgres cluster. If a query were to take it down, it would only take down the replica(s) dedicated to ParadeDB, leaving the primary and all other read replicas dedicated to OLTP safe.
otterley · 22m ago
Are you saying that the cluster isn't homogenous? It sounds like you're describing an architecture that involves a cluster that has two entirely different pieces of software on it, and whose roles aren't interchangeable.
philippemnoel · 2m ago
Bear with me, this will be a bit of a longer answer. Today, there are two topologies under which people deploy ParadeDB.
- <some managed Postgres service> + ParadeDB. Frequently, customers already use a managed Postgres (e.g. AWS RDS) and want ParadeDB. In that world, they maintain their managed Postgres service and deploy a Kubernetes cluster running ParadeDB on the side, with one primary instance and some number of replicas. The AWS RDS primary sends data to the ParadeDB primary via logical replication. You can see a diagram here: https://docs.paradedb.com/deploy/byoc
In this topology, the OLTP and search/OLAP workloads are fully isolated from each other. You have two clusters, but you don't need a third-party ETL service since they're both "just Postgres".
- <self-hosted Postgres> + ParadeDB. Some customers, typically larger ones, prefer to self-host Postgres and want to install our Postgres extension directly. The extension is installed in their primary Postgres, and the CREATE INDEX commands must be issued on the primary; however, they may route reads only to a subset of the read replicas in their cluster.
In this topology, all writes could be directed to the primary, all OLTP read queries could be routed to a pool of read replicas, and all search/OLAP queries could be directed to another subset of replicas.
Both are completely reasonable approaches and depend on the workload. Hope this helps :)
mikegreenberg · 1h ago
Once upon a time, I was using postgres for OLTP and OLAP purposes combined with in-database transforms using TimescaleDB. I had a schema for optimized ingestion and then several aggregate views which produced a bunch of purpose-specific "materialized" tables for efficient analysis based on the ingestion tables.
Timescale had a nice way of abstracting away the cost of updating these views without putting too much load on ingestion (processing multiple TBs of data a time in a single instance with about 500Gb of data churn daily).
worldsayshi · 1h ago
One db that could be interesting here is CrateDB. It's a Lucene based DB that supports the postgres wire protocol. So you can run SQL queries against it.
I've tried figuring out if it supports acting as a pg read-replica, which sounds to me like the ideal set up - but it doesn't seem to be supported.
I have no affiliation to them, just met the team at an event and thought it sounded cool.
philippemnoel · 54m ago
One of the ParadeDB maintainers here -- Being PostgreSQL wire protocol compatible is very different from being built inside Postgres on top of the Postgres pages, which is what ParadeDB does. You still need the "T" in ETL, e.g. transforming data from your source into the format of the sink (in your example CrateDB). This is where ETL costs and brittleness come into play.
Interestingly enough, it looks like the team was just hacking on an open source extension and organically attracted some customers, which snowballed into raising capital. So definitely seems like there’s a market.
cryptonector · 1h ago
For JOINs? Absolutely! Who wants to hand-code queries at the executor level?! It's expensive!
You need a query language.
You don't necessarily need ACID, and you don't necessarily need a bunch of things that SQL RDBMSes give you, but you definitely need a QL, and it has to support a lot of what SQL supports, especially JOINs and GROUP BY w/ aggregations.
NoSQLs tend to evolve into having a QL layered on top. Just start with that if you really want to build a NoSQL.
otterley · 1h ago
To be clear here, I'm not arguing that OpenSearch/ElasticSearch is an adequate substitute for Postgres. They're different databases, each with different strengths and weaknesses. If you need JOINs and ACID compliance, you should use Postgres. And if you need distributed search, you should use OpenSearch/ElasticSearch.
Unless they're building for single-host scale, you're not going to get JOINs for free. Lucene (the engine upon which ES/OS is based) already has JOIN capability. But it's not used in ES/OS because the performance of JOINs is absolutely abysmal in distributed databases.
cryptonector · 13m ago
I'm arguing that sometimes you don't need ACID, or rather, sometimes you accept that ACID is too painful so you accept not having ACID, but no one ever really doesn't want a QL -- they only think that they don't want a QL until they learn better.
I.e., NoACID does not imply NoQueryLanguage, and you can always have a QL, so you should always get a QL, and you should always use a QL.
> Unless they're building for single-host scale, you're not going to get JOINs for free.
If by 'free' you mean not having to code them, then that's wrong. You can always have or implement a QL.
If by 'free' you mean 'performant', then yes, you might have to denormalize your data so that JOINs vanish, though at the cost of write amplification. But so what, that's true whether you use a QL or not -- it's true in SQL RDBMSes too.
philippemnoel · 45m ago
Our customers typically deploy ParadeDB in a primary-replicas topology, with one primary Postgres node and 2 or more read replicas, depending on read volume. Queries are executed on a single node today, yes.
We have plans to eventually support distributed queries.
strbean · 30m ago
Obligatory whine that the term NoSQL got co-opted to mean "no relational". There's tons of space for a better query language for querying relation databases.
derefr · 34m ago
It's funny; as someone who is exactly pg_search's market, I actually often want the opposite: ACID, MVCC transactions, automatic table and index management... but no query language.
At the data scale + level of complexity our OLAP queries operate at, we very often run into situations where Postgres's very best plan [with a well-considered schema, with great indexes and statistics, and after tons of tuning and coaxing], still does something literally interminable — not for any semantic reason to do with the query plan, but rather due to how Postgres's architecture executes the query plan[1].
The last such job, I thought would be simple enough to run in a few hours... I let it run for six days[2], and then gave up and killed it. Whereas, when we encoded the same "query plan" as a series of bulk-primitive ETL steps by:
1. dumping the raw source data from PG to CSV with a `COPY`,
2. whipping out simple POSIX CLI tools like sort/uniq/grep/awk (plus a few hand-rolled streaming aggregation scripts) to transform/reduce/normalize the source data into the shape we want it in,
3. and then loading the resulting CSVs back into PG with another `COPY`,
...then the runtime of the whole operation was reduced to just a few hours, with the individual steps completing in ~30 minutes each. (And that's despite the overhead of parsing and/or emitting non-string fields from/to CSV with almost every intermediate step!)
Honestly, if Postgres would just let us program it the way one programs e.g. Redis through Lua, or ETS tables in Erlang — where the tables and indices are ADTs with low-level public APIs, and you set up your own "query plan" as a set of streaming-channel actors making calls to these APIs — then we would be a lot happier. But even in PL/pgSQL (which we do use, here and there), the only APIs are high-level ones.
• Sure, you can get a cursor on a query; but you can't e.g. get an LMDB-like B-tree cursor on a target B-tree index, and ask it to jump [i.e. re-nav down from root] or walk [i.e. nav up from current pos to nearest common ancestor then back down] to "the first row-tuple greater-than-or-equal to [key]".
• You can't write your own efficient implementation of TABLESAMPLE semantics to set up your own Bigtable-esque balanced cluster-order-partitioned parallel seq scan.
• You can't collect pointers to row-tuples, partially materialize them, filter them by some criterion on the read (but perhaps not parsed!) columns, and then more-fully materialize those same row-tuples "directly" from the references to them you still hold.
---
[1] One example of what I mean by "execution": did you know that Postgres doesn't use any form of concurrency for query plans — not even the most basic libuv-like "This Merge Append node's child-node A is in a blocking-wait on IO; that blocking-wait should yield, so that the Merge Append node's child-node B can instead send row-tuple batches for a while" kind of concurrency?
---
[2] If you're wondering, the query that ran for six days was literally just this (anonymized):
SELECT a, b, SUM(value) AS total_value
FROM (
SELECT a, b, value FROM source1
UNION ALL
SELECT a, b, value FROM source2
) AS u
GROUP BY a, b;
`source1` and `source2` are ~150GB tables. (Or at least, they're 150GB when dumped to CSV.) Two integer keys (a,b), and a bigint value. With a b-tree index on `(a,b) INCLUDE (value)`, with correct statistics.
And its EXPLAIN query plan looked like this (with `SET enable_hashagg = OFF;`) — nominally pretty good:
GroupAggregate (cost=1.17..709462419.92 rows=40000 width=40)
Group Key: a, b
-> Merge Append (cost=1.17..659276497.84 rows=6691282944 width=16)
Sort Key: a, b
-> Index Only Scan using source1_a_b_idx on source1 (cost=0.58..162356175.31 rows=3345641472 width=16)
-> Index Only Scan using source2_a_b_idx on source2 (cost=0.58..162356175.31 rows=3345641472 width=16)
Each one of the operations here is "obvious." It's what you'd think you'd want! You'd think this would finish quickly. And yet.
(And no, the machine it ran on was not resource-bottlenecked. It had 1TB of RAM with no contention from other jobs, and this PG session was allowed to use much of it as work memory. But even if it was spilling to disk at every step... that should have been fine. The CSV equivalent of this inherently "spills to disk", for everything except the nursery levels of sort(1)'s merge-sort. And it does fine.)
cryptonector · 5m ago
> At the data scale + level of complexity our OLAP queries operate at, we very often run into situations where Postgres's very best plan [with a well-considered schema, with great indexes and statistics, and after tons of tuning and coaxing], still does something literally interminable — not for any semantic reason to do with the query plan, but rather due to how Postgres's architecture executes the query plan[1].
Well, ok, this is a problem, and I have run into it myself. That's not a reason for not wanting a QL. It's a reason for wanting a way to improve the query planning. Query hints in the QL are a bad idea for several reasons. What I would like instead is out-of-band query hints that I can provide along with my query (though obviously only when using APIs rather than `psql`; for `psql` one would have to provide the hints via some \hints commnad) where I would address each table source using names/aliases for the table source / join, and names for subqueries, and so really something like a path through the query and subqueries like `.<sub_query_alias0>.<sub_query_alias1>.<..>.<sub_query_aliasN>.<table_source_alias>` and where the hint would indicate things like what sub-query plan type to use and what index to use.
truetraveller · 1h ago
diagramas made how?
hobs · 57m ago
Guessing Figma.
philippemnoel · 46m ago
Yes, Figma!
moezd · 2h ago
Builds an LSM tree for Postgres that is based on B-trees and vacuuming, gets surprised when real workload arrives with too many deletes/updates and watches as LSM compaction and vacuuming butt heads with each other. At that moment, hot_standby_feedback is more like "pls pls let us finish the benchmarks".
Guess we will live to relearn that CAP theorem is a physical limit and not just a theory.
Hard to blame the CAP theorem since this is a problem across an interface layer. If the DB knew about the data it could manage the LSM tree without issue.
Why not just use OpenSearch or ElasticSearch? The tool is already in the inventory; why use a screwdriver when a chisel is needed and available?
This is another one of those “when you have a hammer, everything looks like your thumb” stories.
If we could have a DB that could do search and be a store of record, it would be amazing.
A system that supports OLAP/ad-hoc queries is going to need a ton of IOPs & probably also CPU capacity to do your data transformations. If you want this to also scale beyond the capacity limits of a single node, then you're going to run into distributed joins and network becomes a huge factor.
Now, to support OLTP at the same time, your big, distributed system needs to support ACID, be highly fault-tolerant, etc.
All you end up with is a system that has to be scaled in every dimension. It needs to support the maximum possible workloads you can throw at it, or else a random, expensive reporting query is going to DOS your system and your primary customer-facing system will be unusable at the same time. It is sort of possible, but it's going to cost A LOT of money. You have to have tons and tons of "spare" capacity.
Which brings us to the core of engineering -- anyone can build a system that burns dump trucks full of venture capital dollars to create the one-system-to-rule-them-all. But businesses that want to succeed need to optimize their costs so their storage systems don't break the bank. This is why the current status-quo of specialized systems that do one task well isn't going to change. The current technology paradigm cannot be optimized for every task simultaneously. We have to make tradeoffs.
- <some managed Postgres service> + ParadeDB. Frequently, customers already use a managed Postgres (e.g. AWS RDS) and want ParadeDB. In that world, they maintain their managed Postgres service and deploy a Kubernetes cluster running ParadeDB on the side, with one primary instance and some number of replicas. The AWS RDS primary sends data to the ParadeDB primary via logical replication. You can see a diagram here: https://docs.paradedb.com/deploy/byoc
In this topology, the OLTP and search/OLAP workloads are fully isolated from each other. You have two clusters, but you don't need a third-party ETL service since they're both "just Postgres".
- <self-hosted Postgres> + ParadeDB. Some customers, typically larger ones, prefer to self-host Postgres and want to install our Postgres extension directly. The extension is installed in their primary Postgres, and the CREATE INDEX commands must be issued on the primary; however, they may route reads only to a subset of the read replicas in their cluster.
In this topology, all writes could be directed to the primary, all OLTP read queries could be routed to a pool of read replicas, and all search/OLAP queries could be directed to another subset of replicas.
Both are completely reasonable approaches and depend on the workload. Hope this helps :)
Timescale had a nice way of abstracting away the cost of updating these views without putting too much load on ingestion (processing multiple TBs of data a time in a single instance with about 500Gb of data churn daily).
I've tried figuring out if it supports acting as a pg read-replica, which sounds to me like the ideal set up - but it doesn't seem to be supported.
I have no affiliation to them, just met the team at an event and thought it sounded cool.
You can read more about it here: https://www.paradedb.com/blog/block_storage_part_one
You need a query language.
You don't necessarily need ACID, and you don't necessarily need a bunch of things that SQL RDBMSes give you, but you definitely need a QL, and it has to support a lot of what SQL supports, especially JOINs and GROUP BY w/ aggregations.
NoSQLs tend to evolve into having a QL layered on top. Just start with that if you really want to build a NoSQL.
Unless they're building for single-host scale, you're not going to get JOINs for free. Lucene (the engine upon which ES/OS is based) already has JOIN capability. But it's not used in ES/OS because the performance of JOINs is absolutely abysmal in distributed databases.
I.e., NoACID does not imply NoQueryLanguage, and you can always have a QL, so you should always get a QL, and you should always use a QL.
> Unless they're building for single-host scale, you're not going to get JOINs for free.
If by 'free' you mean not having to code them, then that's wrong. You can always have or implement a QL.
If by 'free' you mean 'performant', then yes, you might have to denormalize your data so that JOINs vanish, though at the cost of write amplification. But so what, that's true whether you use a QL or not -- it's true in SQL RDBMSes too.
We have plans to eventually support distributed queries.
At the data scale + level of complexity our OLAP queries operate at, we very often run into situations where Postgres's very best plan [with a well-considered schema, with great indexes and statistics, and after tons of tuning and coaxing], still does something literally interminable — not for any semantic reason to do with the query plan, but rather due to how Postgres's architecture executes the query plan[1].
The last such job, I thought would be simple enough to run in a few hours... I let it run for six days[2], and then gave up and killed it. Whereas, when we encoded the same "query plan" as a series of bulk-primitive ETL steps by:
1. dumping the raw source data from PG to CSV with a `COPY`,
2. whipping out simple POSIX CLI tools like sort/uniq/grep/awk (plus a few hand-rolled streaming aggregation scripts) to transform/reduce/normalize the source data into the shape we want it in,
3. and then loading the resulting CSVs back into PG with another `COPY`,
...then the runtime of the whole operation was reduced to just a few hours, with the individual steps completing in ~30 minutes each. (And that's despite the overhead of parsing and/or emitting non-string fields from/to CSV with almost every intermediate step!)
Honestly, if Postgres would just let us program it the way one programs e.g. Redis through Lua, or ETS tables in Erlang — where the tables and indices are ADTs with low-level public APIs, and you set up your own "query plan" as a set of streaming-channel actors making calls to these APIs — then we would be a lot happier. But even in PL/pgSQL (which we do use, here and there), the only APIs are high-level ones.
• Sure, you can get a cursor on a query; but you can't e.g. get an LMDB-like B-tree cursor on a target B-tree index, and ask it to jump [i.e. re-nav down from root] or walk [i.e. nav up from current pos to nearest common ancestor then back down] to "the first row-tuple greater-than-or-equal to [key]".
• You can't write your own efficient implementation of TABLESAMPLE semantics to set up your own Bigtable-esque balanced cluster-order-partitioned parallel seq scan.
• You can't collect pointers to row-tuples, partially materialize them, filter them by some criterion on the read (but perhaps not parsed!) columns, and then more-fully materialize those same row-tuples "directly" from the references to them you still hold.
---
[1] One example of what I mean by "execution": did you know that Postgres doesn't use any form of concurrency for query plans — not even the most basic libuv-like "This Merge Append node's child-node A is in a blocking-wait on IO; that blocking-wait should yield, so that the Merge Append node's child-node B can instead send row-tuple batches for a while" kind of concurrency?
---
[2] If you're wondering, the query that ran for six days was literally just this (anonymized):
`source1` and `source2` are ~150GB tables. (Or at least, they're 150GB when dumped to CSV.) Two integer keys (a,b), and a bigint value. With a b-tree index on `(a,b) INCLUDE (value)`, with correct statistics.And its EXPLAIN query plan looked like this (with `SET enable_hashagg = OFF;`) — nominally pretty good:
Each one of the operations here is "obvious." It's what you'd think you'd want! You'd think this would finish quickly. And yet.(And no, the machine it ran on was not resource-bottlenecked. It had 1TB of RAM with no contention from other jobs, and this PG session was allowed to use much of it as work memory. But even if it was spilling to disk at every step... that should have been fine. The CSV equivalent of this inherently "spills to disk", for everything except the nursery levels of sort(1)'s merge-sort. And it does fine.)
Well, ok, this is a problem, and I have run into it myself. That's not a reason for not wanting a QL. It's a reason for wanting a way to improve the query planning. Query hints in the QL are a bad idea for several reasons. What I would like instead is out-of-band query hints that I can provide along with my query (though obviously only when using APIs rather than `psql`; for `psql` one would have to provide the hints via some \hints commnad) where I would address each table source using names/aliases for the table source / join, and names for subqueries, and so really something like a path through the query and subqueries like `.<sub_query_alias0>.<sub_query_alias1>.<..>.<sub_query_aliasN>.<table_source_alias>` and where the hint would indicate things like what sub-query plan type to use and what index to use.
Guess we will live to relearn that CAP theorem is a physical limit and not just a theory.