Polars Cloud and Distributed Polars now available

150 jonbaer 81 9/4/2025, 3:01:49 AM pola.rs ↗

Comments (81)

drej · 12h ago
Having done a bit of data engineering in my day, I'm growing more and more allergic to the DataFrame API (which I used 24/7 for years). From what I've seen over the past ~10 years, 90+% of use cases would be better served by SQL, both from the development perspective as well as debugging, onboarding, sharing, migrating etc.

Give an analyst AWS Athena, DuckDB, Snowflake, whatever, and they won't have to worry about looking up what m6.xlarge is and how it's different from c6g.large.

mrtimo · 9h ago
I agree with this 100%. The creator of duckdb argues that people using pandas are missing out of the 50 years of progress in database research, in the first 5 minutes of his talk here [1].

I've been using Malloy [2], which compiles to SQL (like Typescript compiles to Javascript), so instead of editing a 1000 line SQL script, it's only 18 lines of Malloy.

I'd love to see a blog post comparing a pandas approach to cleaning to an SQL/Malloy approach.

[1] https://www.youtube.com/watch?v=PFUZlNQIndo [2] https://www.malloydata.dev/

orlp · 7h ago
> The creator of duckdb argues that people using pandas are missing out of the 50 years of progress in database research, in the first 5 minutes of his talk here.

That's pandas. Polars builds on much of the same 50 years of progress in database research by offering a lazy DataFrame API which does query optimization, morsel-based columnar execution, predicate pushdown into file I/O, etc, etc.

Disclaimer: I work for Polars on said query execution.

phailhaus · 1h ago
The DataFrame interface itself is the problem. It's incredibly hard to read, write, debug, and test. Too much work has gone into reducing keystrokes rather than developing a better tool.
dev_l1x_be · 59m ago
Not sure what you mean by this. The table concept is the same age as computers. Here is a table, do something with it -> this is the high level df api. All the functions make sense, what is hard to read, write or debug here?

I have used Polars to process 600M of xml files (with a bit of a hack) and the polars part of the code is readable with minimal comments.

Polars has a better api than pandas, at least the intent is easier to understand. (lazyness, yay)

phailhaus · 30s ago
The problem with the dataframe API is that whenever you want to change a small part of your logic, you usually have to rewrite your solution. It is too difficult to write reusable code. Too many functions that try to do everything with a million kwargs that each have their own nuances.
entropicdrifter · 4h ago
Just wanted to say I'm a huge fan of your work. Been using Polars for my team's main project for years and it just keeps getting better.
fumeux_fume · 7h ago
In the same talk, Mark acknowledges that "for data science workflows, database systems are frustrating and slow." Granted DuckDB is an attempt to fix that, most data scientists don't get to choose what database the data is stored in.
willvarfar · 7h ago
(I use duckdb to query data stored in parquet files)
mrtimo · 5h ago
Same. But, I use Malloy which uses duckdb to query data stored in hundreds of parquet files (as if they were one big file).
esafak · 8h ago
Have you used Malloy in a pipeline, e.g., with Airflow? If so, how was the experience?
fumeux_fume · 7h ago
We all have allergies. I'm allergic to 1000 line SQL queries which include functions that are only usable for a specific flavor and version of SQL.
robertkoss · 12h ago
That is a false dichotomy. You can use SQL tools but still have to choose the instance type.

Especially when considering testability and composability, using a DataFrame API inside regular languages like Python is far superior IMO.

gigatexal · 10h ago
Yeah it makes no sense.

Why is the dataframe approach getting hate when you’re talking about runtime details?

That folks understand the almost conversational aspect of SQL vs. that of the dataframe api but the other points make no difference.

If you’re a competent dev/data person and are productive with the dataframe then yay. Also setup and creating test data and such it’s all objects and functions after all — if anything it’s better than the horribad experience of ORMs.

drej · 11h ago
As a user? No, I don't have to choose. What I'm saying is that analysts (who this Polars Cloud targets, just like Coiled or Databricks) shouldn't worry about instance types, shuffling performance, join strategies, JVM versions, cross-AZ pricing etc. In most cases, they should just get a connection string and/or a web UI to run their queries, everything abstracted from them.

Sure, Python code is more testable and composable (and I do love that). Have I seen _any_ analysts write tests or compose their queries? I'm not saying these people don't exist, but I have yet to bump into any.

robertkoss · 11h ago
You were talking about data engineering. If you do not write tests as a data engineer what are you doing then? Just hoping that you don't fuck up editing a 1000 > line SQL script?

If you use Athena you still have to worry about shuffling and joining, it is just hidden.. It is Trino / Presto under the hood and if you click explain you can see the execution plan, which is essentially the same as looking into the SparkUI.

Who cares about JVM versions nowadays? No one is hosting Spark themselves.

Literally every tool now supports DataFrame AND SQL APIs and to me there is no reason to pick up SQL if you are familiar with a little bit of Python

datadrivenangel · 8h ago
Way too many data engineers are running in clown mode just eyeballing the results of 1000 line SQL scripts....

https://ludic.mataroa.blog/blog/get-me-out-of-data-hell/

drej · 11h ago
I was talking about data engineering, because that was my job and all analysts were downstream of me. And I could see them struggle with handling infrastructure and way too many toggles that our platform provided them (Databricks at the time).

Yes, I did write tests and no, I did not write 1000-line SQL (or any SQL for that matter). But I could see analysts struggle and I could see other people in other orgs just firing off simple SQL queries that did the same as non-portable Python mess that we had to keep alive. (Not to mention the far superior performance of database queries.)

But I knew how this all came to be - a manager wanted to pad their resume with some big data acronyms and as a result, we spent way too much time and money migrating to an architecture, that made everyone worse off.

ritchie46 · 11h ago
With Polars Cloud you don't have to choose those either. You can pick cpu/memory and we will offer autoscaling in a few months.

Cluster configuration is optional if you want this control. Anyhow, this doesn't have much to do with the query API, be it SQL or DataFrame.

ayhanfuat · 11h ago
I really doubt that Polars Cloud targets analysts doing ad-hoc analyses. It is much more likely towards people who build data pipelines for downstream tasks (ML etc).
ritchie46 · 11h ago
We also target ad-hoc analysis. If your data doesn't fit on your laptop, you can spin up a larger box or a cluster and run interactive queries.
gigatexal · 10h ago
Again the issue you’re having is the skill level of the audience you keep bringing up not the tool.
drej · 10h ago
I find it much more beneficial to lower the barrier for entry (oftentimes without any sacrifices) instead of spending time and money on upskilling everyone, just because I like engineering.
gigatexal · 8h ago
Right but nobody is saying polars or data frames is to replace SQL or is even for the masses. It’s a tool for skilled folks. I personally think the api makes sense but SQL is easier to pick up. Use whatever tools work best.

But coming into such a discussion dunking on a tool cuz it’s not for the masses makes no sense.

drej · 8h ago
Read my posts again, I'm not complaining it's not for the masses, I know it isn't. I'm complaining that it's being forced upon people when there are simpler alternatives that help people focus on business problems rather than setting up virtual environments.

So I'm very much advocating for people to "[u]se whatever tools work best".

(That is - now I'm doing this. In the past I taught a course on pandas data analytics and spoke at a few PyData conferences and meetups, partly about dataframes and how useful they are. So I'm very much guilty of what all of the above.)

gigatexal · 7h ago
Who is doing the forcing? I’ve not found a place in my decade as a data engineer that such places forced dataframes on would be and capable SQL analysts.
mr_toad · 10h ago
Analysts don’t because it’s not part of the training & culture. If you’re writing tests you’re doing engineering.

That said the last Python code I wrote as a data engineer was to run tests on an SQL database, because the equivalent in SQL would have been tens of thousands of lines of wallpaper code.

riku_iki · 1h ago
> analysts (who this Polars Cloud targets, just like Coiled or Databricks) shouldn't worry about instance types, shuffling performance, join strategies,

I think this part(query optimizations) in general not solved/solvable, and it is sometimes/often(depending on domain) necessary to digg into details to make data transformation working.

drej · 11h ago
Fun aside - I actually used polars for a bit - first time I tried it, I actually thought it was broken, because it finished processing so quickly I thought it silently exited or something.

So I'm definitely a fan, IF you need the DataFrame API. My point was that most people don't need it and it's oftentimes standing in the way. That's all.

orochimaaru · 7h ago
Polars is very nice. I’ve used it off and on. The option to write rust udf’s for performance, easy integration of rust with Python with pyo3 will make it a real contender.

Yes, I know spark and scala exist. I use it. But the underlying Java engines and the tacky Python gateway impact performance and capacity usage. Having your primary processing engine in the same process compiled natively always helps.

spenczar5 · 7h ago
I agree, but there are other possibilities in between those two extremes, like Quivr [1]. Schemas are good, but they can be defined in Python and you get a lot more composability and modularity than you would find in SQL (or pandas, realistically).

1: https://github.com/B612-Asteroid-Institute/quivr

RobinL · 8h ago
100% agree. I've also worked as a data engineer and came to the same conclusion. I wrote up a blog which went into a bit more depth on the topic here: https://www.robinlinacre.com/recommend_sql/
sureglymop · 9h ago
I recently had to create a reproducible version of incredibly complicated and messy R concoctions our data scientists came up with.

I did it with pandas without much experience with it and a lot of AI help (essentially to fill in the blanks the data scientists had left, because they only had to do the calculation once).

I then created a polars version which uses lazyframes. It ended up being about 20x faster than the first version. I did try to do some optimizations by hand to make the execution planner work even better which I believe paid off.

If you have to do a large non interactive analytical calculation (i.e. not in a notebook) polars seems to be way ahead imo!

I do wish that it was just as easy to use as a rust library though.. the focus however seems to be on being competitive in python land mainly.

tomtom1337 · 9h ago
Out of curiosity, what makes a rust library easier to use? Could you expand on that?
ritchie46 · 8h ago
He means that he wants our Rust library as easy as our Python lib. Which I understand as our focus has been mostly on Python.

It is where most of our userbase is and it is very hard for us to have a stable Rust API as we have a lot of internal moving parts which Rust users typically want access to (as they like to be closer to the metal), but has no stability guarantees from us.

In python, we are able to abstract and provide a stable API.

lucasyvas · 8h ago
I understand the user pool comment but don’t understand why you wouldn’t be able to have a rust layer that’s the same as the Python one API-wise.

I say this as a user of neither - just that I don’t see any inherent validity to that statement.

If you are saying Rust consumers want something lower level than you’re willing to make stable, just give them a higher level one and tell them to be happy with it because it matches your design philosophy.

bobbylarrybobby · 7h ago
The issue with Rust is that as a strict language with no function overloading (except via traits) or keyword arguments, things get very verbose. For instance, in python you can treat a string as a list of columns as in `df.select('date')` whereas in Rust you need to write `df.select([col('date')])`. Let's say you want to map a function over three columns, it's going to look something like this:

``` df.with_column( map_multiple( |columns| { let col1 = columns[0].i32()?; let col2 = columns[1].str()?; let col3 = columns[3].f64()?; col1.into_iter() .zip(col2) .zip(col3) .map(|((x1, x2), x3)| { let (x1, x2, x3) = (x1?, x2?, x3?); Some(func(x1, x2, x3)) }) .collect::<StringChunked>() .into_column() }, [col("a"), col("b"), col("c")], GetOutput::from_type(DataType::String), ) .alias("new_col"), ); ```

Not much polars can do about that in Rust, that's just what the language requires. But in Python it would look something like

``` df.with_columns( pl.struct("a", "b", "c") .map_elements( lambda row: func(row["a"], row["b"], row["c"]), return_dtype=pl.String ) .alias("new_col") ) ```

Obviously the performance is nowhere close to comparable because you're calling a python function for each row, but this should give a sense of how much cleaner Python tends to be.

quodlibetor · 3h ago
> Not much polars can do about that in Rust

I'm ignorant about the exact situation in Polars, but it seems like this is the same problem that web frameworks have to handle to enable registering arbitrary functions, and they generally do it with a FromRequest trait and macros that implement it for functions of up to N arguments. I'm curious if there are were attempts that failed for something like FromDataframe to enable at least |c: Col<i32>("a"), c2: Col<f64>("b")| {...}

https://github.com/tokio-rs/axum/blob/86868de80e0b3716d9ef39...

https://github.com/tokio-rs/axum/blob/86868de80e0b3716d9ef39...

bobbylarrybobby · 2h ago
You'd still have problems.

1. There are no variadic functions so you need to take a tuple: `|(Col<i32>("a"), Col<f64>("b"))|`

2. Turbofish! `|(Col::<i32>("a"), Col::<f64>("b"))|`. This is already getting quite verbose.

3. This needs to be general over all expressions (such as `col("a").str.to_lowercase()`, `col("b") * 2`, etc), so while you could pass a type such as Col if it were IntoExpr, its conversion into an expression would immediately drop the generic type information because Expr doesn't store that (at least not in a generic parameter; the type of the underlying series is always discovered at runtime). So you can't really skip those `.i32()?` calls.

Polars definitely made the right choice here — if Expr had a generic parameter, then you couldn't store Expr of different output types in arrays because they wouldn't all have the same type. You'd have to use tuples, which would lead to abysmal ergonomics compared to a Vec (can't append or remove without a macro; need a macro to implement functions for tuples up to length N for some gargantuan N). In addition to the ergonomics, Rust’s monomorphization would make compile times absolutely explode if every combination of input Exprs’ dtypes required compiling a separate version of each function, such as `with_columns()`, which currently is only compiled separately for different container types.

The reason web frameworks can do this is because of `$( $ty: FromRequestParts<S> + Send, )*`. All of the tuple elements share the generic parameter `S`, which would not be the case in Polars — or, if it were, would make `map` too limited to be useful.

tomtom1337 · 7h ago
Ah, of course. Slightly ambiguous English tricked me there. Thank you Ritchie!
sureglymop · 5h ago
I apologize for that, English isn't my first language. Glad it was explained so well!
robertkoss · 13h ago
dvko · 12h ago
Never forget! Crazy to see how far it's come. And how lackluster the initial reception on HN was back then.
qrush · 30m ago
I thought this was about my favorite sparkling water brand at first glance.
robertkoss · 13h ago
Love it!

Still don't get why one of the biggest player in the space, Databricks is overinvesting in Spark. For startups, Polars or DuckDB are completely sufficient. Other companies like Palantir already support bring your own compute.

mr_toad · 9h ago
Databricks is targeting large enterprises, who have a variety of users. Having both Python and SQL as first class languages is a selling point.
whyever · 12h ago
That's a good question! Especially after Frank McSherry's COST paper [1], it's hard to imagine where the sweet spot for Spark is. I guess for Databricks it makes sense to push Spark, since they are the ones who created it. In a way, it's their competitive advantage.

[1]: https://www.usenix.org/system/files/conference/hotos15/hotos...

cantdutchthis · 13h ago
Been a polars fan for a loooong time. Happy to see the team ship their product and I hope it does well!
lvl155 · 12h ago
Polars is certainly better than pandas doing things locally. But that is a low bar. I’ve not had great experience using Polars on large enough datasets. I almost always end up using duckdb. If I am using SQL at the end of the day, why bother starting with Polars? With AI these days, it’s ridiculously fast to put together performant SQLs. Heck you can even make your own grammar and be done with it.
sirfz · 10h ago
SQL is definitely easier and faster to compose than any dataframe syntax but I think pandas syntax (via slicing API) is faster to type and in most cases more intuitive but I still use polars for all df-related tasks in my workflow since it's more structured and composable (although needs more time to construct but that's a cost I'm willing to take when not simply prototyping). When in an ipython session, sql via duckdb is king. Also: python -m chdb "describe 'file.parquet'" (or any query) is wonderful
mr_toad · 9h ago
> SQL is definitely easier and faster to compose

Sometimes. But sometimes Python is just much easier. For example transposing rows and columns.

infecto · 10h ago
I guess if it’s too large to be performant than SQL can be the way to go. I avoid sql for one off tasks though as I can more easily grok transformations in polars code than sql queries.
dkdcio · 10h ago
you can use Ibis if you want a dataframe UI on top of DuckDB (or a number of other query engines, including Polars)
boomer_joe · 11h ago
I don't understand. Can I use distributed Polars with my own machines or do I have to buy cloud compute to run distributed queries (I don't want that). If not, is this planned?
ritchie46 · 11h ago
On-premises is in the works. We expect this in a couple of months. Currently it is managed on AWS only.
boomer_joe · 11h ago
Thanks! Will it be paid or open source?
rubenvanwyk · 11h ago
Paid
jpcompartir · 12h ago
Polars is great, absolute best of luck with the launch
willvarfar · 13h ago
Hmm so how does the polars SQLContext stack up against duckdb? And can both cope with a distributed polars?

It feels like we are on the path to reinventing BigQuery.

ritchie46 · 13h ago
Hi, I am the original author and CEO of Polars. We are not focused on SQL at this time and provide a DataFrame native API.

Polars cloud will for the moment only support our DataFrame API. SQL might come later on the roadmap, but since this market is very saturated, we don't feel there is much need there.

weinzierl · 13h ago
Out of curiosity and because I don't want to create a test account right now:

How does billing with "Deploy on AWS" work? Do I need to bring my own AWS account and Polars is payed for the image through AWS or am I billed by Polars and they pass a share to AWS. In other words do I have a contract primarily with AWS or Polars?

ritchie46 · 13h ago
Your billing partner is AWS. Polars' markup is on your AWS bill.
gigatexal · 13h ago
Cool. But abstract away the infra knowledge to the actual instance types. Instead I’d expect the polars cloud abstraction to find me the most cost effective (spot instance) that meets my cpu and memory reqs and disk reqs. Why do I have to give it — looking at the example — the AWS instance type?
ritchie46 · 13h ago
You don't have to. Passing cpu and memory works as well.

    pc.ComputeContext{
        cpus=4, 
        memory=16
    }
We are working on a minimal cluster and auto-scaling based on the query.
gigatexal · 12h ago
Nice!

Ritchie, curious you mentioned in other responses that the SQL context stuff is out of scope for now. But I thought the SQL things were basically syntactic sugar to the dataframes in other words they both “compile” down to the same thing. If true then being able to run arbitrary SQL queries should be doable out of the box?

ritchie46 · 12h ago
Not right now. Our current SQLContext locally inspects schema's to convert the SQL to Polars LazyFrames (DSL).

However, this should happend during IR-resolving. E.g. the SQL should translate directly to Polars IR, and not LazyFrames. That way we can inspect/resolve all schema's server-side.

It requires a rewrite of our SQL translation in OSS. This should not be too hard, but it is quite some work. Work we eventually get to.

gigatexal · 12h ago
Thanks for the context.
raoulj · 9h ago
Is there any distributed polars for non Polars Cloud?

EDIT: nevermind see same question in this thread. The answer is no!

ritzaco · 12h ago
Maybe just me, but for anyone else who was confused

- Polars (Pola.rs) - the DataFrames library that now has a cloud version

- Polar (Polar.sh) - Payments and MoR service built on top of Stripe

forks · 10h ago
- Polar, the authorization DSL created by Oso

It's a common name

blackhaz · 9h ago
How does Polars compare to FireDucks?
nivekney · 8h ago
cmollis · 9h ago
can i run a distributed computation in pola.rs cloud on my own AWS infra? or do I need to run it on-prem?
dbacar · 3h ago
SnowFlake, Polars, DucksDB, FireBase, FireDuck... I guess the next product will be IceDuck.

What is wrong with you DB people :))).

anonu · 13h ago
So competing with SnowFlake?
rorads · 13h ago
EDIT: I think the below is correct, but I’ve just seen in the main product landing page that for a certain benchmark it’s an order of magnitude cheaper AND faster than AWS glue, so that’s the target market by the looks of things.

——

I don’t think so - probably more in the realms of spark and, based on the roadmap, airflow.

For me it would be about doing big data analytics / dashboarding / ML or DS data prep.

My understanding is that Snowflake plays a lot in the data warehouse/lakehouse space, so is more central to data ops / cataloguing / SSOT type work.

But hey that’s all first impressions from the press release.

dkdcio · 10h ago
moreso competing with Coiled Computing (Dask version, very similar, you can run Polars there too). and then Databricks more than Snowflake, but all of these data platforms converge on similar features. also competing with Fivetran eventually after their acquisition yesterday
cbb330 · 13h ago
can you dive a bit deeper into the comparison with spark rdd
ritchie46 · 12h ago
I am not an expert on Spark RDDs, but AFAIK they are a more low-level data structure that offer resilience and a lower level map-reduce API.

Polars Cloud maps the Polars API/DSL to distributed compute. This is more akin to Spark's high level DataFrame API.

With regard to implementation, we create stages that run parts of Polars IR (internal representation) on our OSS streaming engine. Those stages run on 1 or many workers create data that will be shuffled in between stages. The scheduler is responsible for creating the distributed query plan and work distribution.

ayhanfuat · 11h ago
Can you tell a little about the status of Iceberg write support? Partitioning, maintenance etc.
ritchie46 · 8h ago
We have full iceberg read support. We have done some preliminary work for iceberg write support. I think we will ship that once we have decided which Catalog we will add. The iceberg write API is intertwined with that.