Show HN: Xorq – open-source Python-first Pandas-style pipelines
We’d love your feedback and contributions. xorq is [Apache 2.0 licensed](https://github.com/letsql/xorq/blob/main/LICENSE) to encourage open collaboration.
Repo: https://github.com/letsql/xorq
Docs: https://docs.xorq.dev
Roadmap Issues: https://github.com/letsql/xorq
You can get started `pip install xorq`.
Or, if you use nix, you can simply run `nix run github:xorq-labs/xorq` and drop into an IPython shell.
Demo video: https://youtu.be/jUk8vrR6bCw
Here are some vignettes to look into next:
1. MCP Server + Flight + XGBoost: https://docs.xorq.dev/vignettes/mcp_flight_server
2. 1 DuckDB + 2 Writers + 1 Reader: https://docs.xorq.dev/vignettes/duckdb_concurrent
3. OpenAI UDF: https://docs.xorq.dev/tutorials/hn_data_prep
Some features to note:
- Ibis-based multi-engine expression system: effortless engine-to-engine streaming
- Cache expressions with `.cache` operator
- Portable DataFusion-backed UDF engine with first class support for pandas dataframes
- Serialize Expressions to and from YAML
- Easily build Flight end-points by composing UDFs
thanks for checking this out, and we’re here to answer any questions!
I really tried to work Ibis into my projects when I thought the native Ibis functions could be used until I went back to another tool like DuckDb or Polars, but I was finding Ibis couldn’t do some things. At that point it was either flip over to polars to do x, or just use polars.
We have ambitions of supporting alternative APIs like Narwhals in future though, that can map polars API to Ibis's internal representation.
We found Ibis to be super extensible. In xorq, we also support pandas UDFs, so if you know pandas you should be pretty well covered. UDFs are pretty nice way to extend the API.
What sort of operations did you find Ibis was missing when you tried? What about it isn't extensible?
1. How does it compare against alternatives?
2. Do you have benchmarks?
- Ibis because while it can target multiple engines (as we state in our docs, we are built on and heavily reliant on Ibis), it aims to be "single engine, single session" in its execution in that nothing is expected to persist beyond the current session and an Ibis expression can only have a single engine. We want to be multi-engine and have some artifacts durable across sessions (by way of caching)
- Snowpark because it is sort of "multi-engine" by way of external functions or python stages, but locked to Snowflake. In some sense, we want to be Anypark: Snowpark like functionality but centered on whatever engine of choice is desired and performant interop with any other engines.
2. We don't have anything I would hold out as benchmarks yet. We don't aim to be "best in class" / the "fastest engine", we aim to be "in class" for as many operations as possible (we use the word performant). Our goal is to make it easy for an org to choose whichever engine(s) they feel most performant in when they consider the full space of {developer,computation} x {time,cost}. However, Hussain has demonstrated how having information from the "whole pipeline" available but execution deferred can allow for specialized optimization by way of predicate pushdowns (https://ibis-project.org/posts/udf-rewriting/)
Thanks for your interest and please feel free challenge any of the above or point us to anything you think we might have overlooked!
Best Dan
We have previously demonstrated the capability of doing iterative batch training by way of our "batteries-included" engine. I'll try to post a reference later but need to run now due to family obligations.
Anecdotally, TPC-H 10 TB is pretty doable now a days with DuckDB, so xorq goes as far as your engine may take you...