Show HN: Hydra (YC W22) – Serverless Analytics on Postgres
Traditionally, this was unfeasible: Postgres is a rowstore database that’s 1000X slower at analytical processing than a columnstore database.
(A quick refresher for anyone interested: A rowstore means table rows are stored sequentially, making it efficient at inserting / updating a record, but inefficient at filtering and aggregating data. At most businesses, analytical reporting scans large volumes of events, traces, time-series data. As the volume grows, the inefficiency of the rowstore compounds: i.e. it's not scalable for analytics. In contrast, a columnstore stores all the values of each column in sequence.)
For decades, it was a requirement for businesses to manage these differences between the row and columnstore’s relative strengths, by maintaining two separate systems. This led to large gaps in both functionality and syntax, and background knowledge of engineers. For example, here are the gaps between Redshift (a popular columnstore) and Postgres (rowstore) features: (https://docs.aws.amazon.com/redshift/latest/dg/c_unsupported...).
We think there’s a better, simpler way: unify the rowstore and columnstore – keep the data in one place, stop the costs and hassle of managing an external analytics database. With Hydra, events, traces, time-series data, user sessions, clickstream, IOT telemetry, etc. are now accessible as a columnstore right alongside my standard rowstore tables.
Our solution: Hydra separates compute from storage to bring the analytics columnstore with serverless processing and automatic caching to your postgres database.
The term "serverless" can be a bit confusing, because a server always exists, but it means compute is ephemeral and spun up and down automatically. The database automatically provisions and isolates dedicated compute resources for each query process. Serverless is different from managed compute, where the user explicitly chooses to allocate and scale CPU and memory continuously, and potentially overpay during idle time.
How is serverless useful? It's important that every analytics query has its own resources per process. The major hurdles with running analytics on Postgres is 1) Rowstore performance 2) Resource contention. #2 is very often overlooked - but in practice, when analytics queries are run they tend to hog resources (RAM and CPU) from Postgres transactional work. So, a slightly expensive analytics query has the ability to slow down the entire database: that's why serverless is important: it guarantees the expensive queries are isolated and run on dedicated database resources per process.
why is hydra so fast at analytics? (https://tinyurl.com/hydraDBMS) 1) columnstore by default 2) metadata for efficient file-skipping and retrieval 3) parallel, vectorized execution 4) automatic caching
what’s the killer feature? hydra can quickly join columnstore tables with standard row tables within postgres with direct sql.
example: “segment events as a table.” Instead of dumping segment event data into a s3 bucket or external analytics database, use hydra to store and join events (clicks, signups, purchases) with user profile data within postgres. know your users in realtime: “what events predict churn?” or “which user will likely convert?” is immediately actionable.
Thanks for reading! We would love to hear your feedback and if you'd like to try Hydra now, we offer a $300 credit and 14-days free per account. We're excited to see how bringing the columnstore and rowstore side-by-side can help your project.
> Visit http://platform.hydra.so/token to fetch the access token and paste it into the section above.
1- what data is shared with hydra for this case?
2- whats the pricing for the bare metal deployment?
Having followed the project for a while now, I really scratch my head when looking at your pricing.
The entire innovation of the past decade in database land has gone towards decoupling storage and compute, driving query engines (like DuckDB) and file formats (like Iceberg).
Yet you force-bundle storage and compute in your pricing while also selling a serverless product.
What's the reason behind that?
Why do it in the first place?
How does your pricing work?
The 40/ 500 compute hours I get are included in the spend limit per tier (i.e. max 160 additional hours in Starter etc.) or completely separate?
Why are there member constraints on a database product?
How does that factor into cost/ map to SDL / reasonable team setups of people operating analytics projects revolving around a database like yours?
I have never seen such a limit with any other vendor and esp. when you wanna get a hold in the market/ have people start using Hydra for the specialized role it can provide, having a 2 person limit for the minimum tier if I wanna PoC this would likely be a show stopper tbh...
One of the downsides of serverless is that it can be difficult to predict the overall monthly cost when the granularity of billing (per invocation, memory usage, or execution time) is complex. For developers this might be totally fine (even preferred), but we think that giving a single, predictable price: Hydra $100 / month is better for businesses to plan around.
Usage caps per plan are purely soft limits so users don't actually encounter them. Yes, we want people to upgrade to higher plans. In the words of Maya Angelou "Be careful when a naked person offers you a shirt" - meaning, we believe these are the best prices we can offer today to build a sustainable project on. That said, I appreciate your point about our # of users limit. If we removed that limit would you try out Hydra?
we currently use aws aurora. how easy would it be to simply sql dump and load into hydra and how well would it serve as a drop in replacement?
We initiall set the rowstore as default, but people wouldn't create columnstore tables and were confused on why performance wasn't improving. So, figured this was cleaner, but you always have the option to switch the default table type back.