Launch HN: ParaQuery (YC X25) – GPU Accelerated Spark/SQL

69 winwang 36 5/12/2025, 4:01:31 PM
Hey HN! I'm Win, founder of ParaQuery (https://paraquery.com), a fully-managed, GPU-accelerated Spark + SQL solution. We deliver BigQuery's ease of use (or easier) while being significantly more cost-efficient and performant.

Here's a short demo video demonstrating ParaQuery (vs. BigQuery) on a simple ETL job: https://www.youtube.com/watch?v=uu379YnccGU

It's well known that GPUs are very good for many SQL and dataframe tasks, at least by researchers and GPU companies like NVIDIA. So much so that, in 2018, NVIDIA launched the RAPIDS program and the Spark-RAPIDS plugin (https://github.com/NVIDIA/spark-rapids). I actually found out because, at the time, I was trying to craft a CUDA-based lambda calculus interpreter…one of several ideas I didn't manage to implement, haha.

There seems to be a perception among at least some engineers that GPUs are only good for AI, graphics, and maybe image processing (maybe! someone actually told me they thought GPUs are bad for image processing!) Traditional data processing doesn’t come to mind. But actually GPUs are good for this as well!

At a high level, big data processing is a high-throughput, massively parallel workload. GPUs are a type of hardware specialized for this, are highly programmable, and (now) happen to be highly available on the cloud! Even better, GPU memory is tuned for bandwidth over raw latency, which only improves their throughput capabilities compared to a CPU. And by just playing with cloud cost calculators for a couple of minutes, it's clear that GPUs are cost-effective even on the major clouds.

To be honest, I thought using GPUs for SQL processing would have taken off by now, but it hasn't. So, just over a year ago, I started working on actually deploying a cloud-based data platform powered by GPUs (i.e. Spark-RAPIDS), spurred by a friend-of-a-friend(-of-a-friend) who happened to have BigQuery cost concerns at his startup. After getting a proof of concept done and a letter of intent... well, nothing happened! Even after over half a year. But then, something magical did happen: their cloud credits ran out!

And now, they're saving over 60% off of their BigQuery bill by using ParaQuery, while also being 2x faster -- with zero data migration needed (courtesy of Spark's GCS connector). By the way, I'm not sure about other people's experiences but... we're pretty far from being IO-bound (to the surprise of many engineers I've spoken to).

I think that the future of high-throughput compute is computing on high-throughput hardware. If you think so too, or you have scaling data challenges, you can sign up here: https://paraquery.com/waitlist. Sorry for the waitlist, but we're not ready for a self-serve experience just yet—it would front-load significant engineering and hardware cost. But we’ll get there, so stay tuned!

Thanks for reading! What have your experiences been with huge ETL / processing loads? Was cost or performance an issue? And what do you think about GPU acceleration (GPGPU)? Did you think GPUs were simply expensive? Would love to just talk about tech here!

Comments (36)

latchkey · 21m ago
Great, let me know if you want a 1x AMD MI300x VM to build/test on. Free.

https://x.com/HotAisle/status/1921983426972025023

winwang · 1m ago
Nice! I attended a hackathon by Modular last weekend where we got to play with MI300X (sponsored by AMD and Crusoe). My team made a GPU-"accelerated" BM25 in Mojo, but mostly kind of failed at it, haha.

The software stack for AMD is still a bit too nascent for ParaQuery's Spark engine, but certain realtime/online workloads can definitely be programmed pretty fast. They also happen to benefit greatly from the staggering levels of HBM on AMD chips. Hopefully I can take a mini-vacation later in the summer to hack on your GPUs :)

mritchie712 · 1h ago
> they're saving over 60% off of their BigQuery bill

how big is their data?

A lot of BigQuery users would be surprised to find they don't need BigQuery.

This[0] post (written by founding engineer of BigQuery) has a bit of hyperbole, but this part is inline with my experience:

> A couple of years ago I did an analysis of BigQuery queries, looking at customers spending more than $1000 / year. 90% of queries processed less than 100 MB of data. I sliced this a number of different ways to make sure it wasn’t just a couple of customers who ran a ton of queries skewing the results. I also cut out metadata-only queries, which are a small subset of queries in BigQuery that don’t need to read any data at all. You have to go pretty high on the percentile range until you get into the gigabytes, and there are very few queries that run in the terabyte range.

We're[1] built on duckdb and I couldn't be happier about it. Insanely easy to get started with, runs locally and client-side in WASM, great language features.

0 - https://motherduck.com/blog/big-data-is-dead/

1 - https://www.definite.app/

winwang · 52m ago
They have >1PB of data to ETL, with some queries hitting 450TB of pure shuffle.

It's very true that most users don't need something like BigQuery or Snowflake. That's why some startups have come up to save Snowflake cost by "simply" putting a postgres instance in front of it!

In fact, I just advised someone recently to simply use Postgres instead of BigQuery since they had <1TB and their queries weren't super intensive.

vpamrit2 · 5m ago
Very cool and great job! Would love to see more as you progress on your journey!
dogman123 · 1h ago
This could be incredibly useful for me. Currently struggling to complete jobs with massive amounts of shuffle with Spark on EMR (large joins yielding 150+ billion rows). We use Glue currently, but it has become cost prohibitive.
winwang · 1h ago
Is the shuffle the biggest issue? Not too sure about joins but one of the datasets we're currently dealing with has a couple trillion rows. Would love to chat about this!
Boxxed · 48m ago
I'm surprised the GPU is a win when the data is coming from GCS. The CPU still has to touch all the data, right? Or do you have some mechanism to keep warm data live in the GPUs?
winwang · 43m ago
Yep, CPU has to transfer data because no RDMA setup on GCP lol. But that's like 16-32 GB/s of transfer per GPU (assuming T4/L4 nodes), which is much more than network bandwidth. And we're not even network bound, even if there's no warm data (i.e. for our ETL workloads). However, there is some stuff kept on GPU during actual execution for each Spark task even if they aren't running on the GPU at the moment, which makes handling memory and partition sizes... "fun", haha.
billy1kaplan · 1h ago
Congrats on the launch, Win! I remember seeing your prototype a while ago and its awesome to see your progress!
achennupati · 20m ago
this is super cool stuff, and would've been really interesting to apply to spark stuff I had to do for Amazon's search system! How is this different than something like using spark-rapids on AWS EMR with GPU-enabled EC2 instances? Are you building on top of that spark-rapids, or is this a more custom solution?
random17 · 3h ago
Congrats on the launch!

Im curious about what kinds of workloads you see GPU-accelerated compute have a significant impact, and what kinds still pose challenges. You mentioned that I/O is not the bottleneck, is that still true for queries that require large scale shuffles?

winwang · 3h ago
Large scale shuffles: Absolutely. One of the larger queries we ran saw a 450TB shuffle -- this may require more than just deploying the spark-rapids plugin, however (depends on the query itself and specific VMs used). Shuffling was the majority of the time and saw 100% (...99%?) GPU utilization. I presume this is partially due to compressing shuffle partitions. Network/disk I/O is definitely not the bottleneck here.

It's difficult to say what "workloads" are significant, and easier to talk about what doesn't really work AFAIK. Large-scale shuffles might see 4x efficiency, assuming you can somehow offload the hash shuffle memory, have scalable fast storage, etc... which we do. Note this is even on GCP, where there isn't any "great" networking infra available.

Things that don't get accelerated include multi-column UDFs and some incompatible operations. These aren't physical/logical limitations, it's just where the software is right now: https://github.com/NVIDIA/spark-rapids/issues

Multi-column UDF support would likely require some compiler-esque work in Scala (which I happen to have experience in).

A few things I expect to be "very" good: joins, string aggregations (empirically), sorting (clustering). Operations which stress memory bandwidth will likely be "surprisingly" good (surprising to most people).

Otherwise, Nvidia has published a bunch of really-good-looking public data, along with some other public companies.

Outside of Spark, I think many people underestimate how "low-latency" GPUs can be. 100 microseconds and above is highly likely to be a good fit for GPU acceleration in general, though that could be as low as 10 microseconds (today).

_zoltan_ · 1h ago
8TB/s bandwidth on the B200 helps :-) [yes, yes, that is at the high end, but 4.8TB/s@H200, 4TB/s@H100, 2TB/s@A100 is nothing to sneeze at either).
winwang · 1h ago
Very true. Can't get those numbers even if you get an entire single-tenant CPU VM. Minor note, A100 40G is 1.5TB/s (and much easier to obtain).

That being said, ParaQuery mainly uses T4 and L4 GPUs with "just" ~300 GB/s bandwidth. I believe (correct me if I'm wrong) that should be around a 64-core VM, though obviously dependent on the actual VM family.

_zoltan_ · 1h ago
couple weeks ago at VeloxCon one of the days was dedicated to GPU processing (the other being AI/ML data preprocessing); the cuDF team talked about their Velox integration as well. for those interested, might worth to check it out.

disclaimer: my team is working on this very problem as well, as I was a speaker at VeloxCon.

winwang · 1h ago
Just checked out Velox. It's awesome that you're reducing duplicate eng effort! What was your talk about?
_zoltan_ · 1h ago
I was part of the panel discussion at the end of the 2nd day discussing hardware acceleration for query processing. before the panel there were very interesting talks about various approaches on how to get to the end goal, which is to efficiently use hardware (non-x86 CPUs, so mostly either GPUs or FPGAs/custom chips) to speed up queries.
winwang · 1h ago
Since you mentioned non-x86, how are things on the ARM side? I believe I heard AWS's Graviton + Correto combo was a huge increase for JVM efficiency.

FPGAs... I somehow highly doubt their efficiency in terms of being the "core" (heh) processor. However, "compute storage" with FPGAs right next to the flash is really interesting.

dbacar · 1h ago
Cool project, congratulations.

How would you contrast it against HeavyDB?

https://github.com/heavyai/heavydb

winwang · 1h ago
I'm not too familiar with HeavyDB, but here are the main differences:

- We're fully compatible with Spark SQL (and Spark). Meaning little to no migration overhead.

- Our focus is on distributed compute first.

- That means ParaQuery isn't a database, just an MPP engine (for now). Also means no data ingestion/migration needed.

jelder · 3h ago
Any relationship with the PG-Strom project?

http://heterodb.github.io/pg-strom/

winwang · 3h ago
No relationship... yet! Hoping to have a good relationship in the future so I have a business reason to fly to Japan :D

Btw, interesting thing they said here: "By utilization of GPU (Graphic Processor Unit) device which has thousands cores per chip"

It's more like "hundreds", since the number of "real" cores is like (CUDA cores / 32). Though I think we're about to see 1k cores (SMSPs).

That being said, I do believe CUDA cores have more interesting capabilities than a typical vector lane, i.e. for memory operations (thank the compiler). Would love to be corrected!

justinl33 · 3h ago
So nice to see GPU's being used for classical reasons again.
winwang · 2h ago
Not sure if Spark is a classical portion for GPU compute ;) Well, outside of HPC and research.

SQL on GPUs is definitely a research classic, dating back to 2004 at least: https://gamma.cs.unc.edu/DB/

zackmorris · 1h ago
Came here to say the same thing.

Set Theory is the classical foundation of SQL:

https://www.sqlshack.com/mathematics-sql-server-fast-introdu...

It's analogous to how functional programming expressed through languages like lisp is the classical foundation of spreadsheets.

I believe that skipping first principles (sort of like premature optimization) is the root of all evil. Some other examples:

- If TCP had been a layer above UDP instead of its own protocol beside it, we would have had real peer to peer networking this whole time instead of needing WebRTC.

- If we had a common serial communication standard analogous to TCP for sockets, then we wouldn't need different serial ports like USB, Thunderbolt and HDMI.

- If we hid the web browser's progress bar and used server-side rendering with forms, we could implement the rich interfaces of single-page applications with vastly reduced complexity by keeping the state, logic and validation in one place with no perceptible change for the average user.

- If there was a common scripting language bundled into all operating systems, then we could publish native apps as scripts with substantially less code and not have to choose between web and mobile for example.

- If we had highly multicore CPUs with hundreds or thousands of cores, then multiprocessing, 3D graphics and AI frameworks could be written as libraries running on them instead of requiring separate GPUs.

And it's not just tech. The automative industry lacks standard chassis types and even OEM parts. We can't buy Stirling engines or Tesla turbines off the shelf. CIGS solar panels, E-ink displays, standardized removable batteries, thermal printers for ordinary paper, heck even "close enough" contact lenses, where are these products?

We make a lot of excuses for why the economy is bad, but just look at how much time and effort we waste by having to use cookie cutter solutions instead of having access to the underlying parts and resources we need. I don't think that everyone is suddenly becoming neurodivergent from vaccines or some other scapegoat, I think it's just become so obvious that the whole world is broken and rigged to work us all to the grave to make some guy rich that it's giving all of us ADHD symptoms from having to cope with it.

winwang · 56m ago
I'm not sure about the rest of your comment, but we would likely still want GPUs even with highly multicore CPUs. Case in point: the upper-range Threadripper series.

It makes sense to have two specialized systems: a low-latency system, and a high-throughput system, as it's a real tradeoff. Most people/apps need low-latency.

As for throughput and efficiency... turns out that shaving off lots of circuitry allows you to power less circuitry! GPUs have a lot of sharing going on and not a lot of "smarts". That doesn't even touch on their integrated throughput optimized DRAM (VRAM/HBM). So... not quite. We'd still be gaming on GPUs :)

_zoltan_ · 1h ago
nobody so far mentioned Apache Gluten yet. are you familiar? how do you compare?
winwang · 1h ago
I was about to comment that Gluten is only targeting CPU vectorization, but then I found this (very cool!): https://github.com/apache/incubator-gluten/issues/9098

I'm not very familiar with Gluten, but I'll still comment on the CPU side though, assuming that one of Gluten's goals is to use the full vector processing (SIMD) potential of the CPU. In that case, we'd still be memory(-bandwidth)-bound, not to mention the significantly lower FLOPs of the CPU itself. If we vectorize Spark (or any MPP) for efficient compute, perhaps we should run it on hardware optimized for vectorized, super-parallel, high-throughput compute.

Also, there's nothing which says we can't use Gluten to have even more CPU+GPU utilization!

modelorona · 3h ago
Awesome to see GPUs being used for something other than crypto (are they still?) and AI.

How is it priced? I couldn't see anything on the site.

latchkey · 20m ago
After ETH PoS, GPUs are no longer used for wide scale crypto mining.
winwang · 3h ago
Still figuring out pricing! For our first customers, we're doing pricing as either bytes scanned or by compute time, similar to BigQuery. Also experimenting with a contract that also gives the minimum of the two potential charges (up to a sustainable limit).

However, for deployments to the customer's cloud, it would be a stereotypical enterprise license + support.

Can't wait to actually add an FAQ to the site, hopefully based off the questions asked here. Pricing is one of the things preventing me from just allowing self-serve, since it has to be stable, sustainable, and cheap!

Also, with the GPU clouds, pricing would have to be different per cloud, though I guess I can worry about that later. Would be crazy cheap(er) to process on them.

As far as I know, GPUs are definitely still being used in crypto/web3... and AI for that matter :P

bbrw · 54m ago
congrats on the launch win! thrilled to see your product launch after following your journey from the start.
spencerbrown · 4h ago
I'm super excited about this. I saw an early demo and it's epic. Congrats on the launch, Win!
winwang · 3h ago
Thanks! I was also told to make a performance-focused demo... didn't do it in time, but was able to go from a 44-minute BigQuery job to a 5.5-minute ParaQuery job, with a similar dataset/query as the video here.

8x faster!

gpu3546 · 2h ago
very nice