Show HN: Term – Rust-based data validation with OpenTelemetry

2 ericpsimon 0 7/30/2025, 3:44:45 PM github.com ↗
Hi HN, I'm Eric and I'm a recovering data engineer. Recently I have worked on the data platforms for multiple YC backed start-ups Kable (YC W22) and Finch (YC S20).

Every data team I've worked with struggles with data quality validation. Current solutions like Apache Deequ require spinning up entire Spark clusters just to check if your data meets basic quality constraints.

When I found Apache DataFusion, it was love at first sight - it provided the ergonomics of Apache Spark, without the overhead, JVM, etc. That is what led me to build Term. It is able to take advantage of the ergonomics of Spark without the overhead.

Term is a Rust library that provides Deequ-style data validation using Apache DataFusion. You can run comprehensive data quality checks anywhere - from your laptop to CI/CD pipelines - without any JVM or cluster setup. On a 1M row dataset with 20 constraints, Term completes validation in 0.21 seconds (vs 3.2 seconds without optimization) by intelligently batching operations into just 2 scans instead of 20.

The technical approach: Term leverages DataFusion's columnar processing engine to efficiently validate data in Arrow format. Validation rules compile directly to DataFusion's physical plans, and Rust's zero-cost abstractions mean the overhead is minimal. You get 100MB/s single-core throughput, which often outperforms distributed solutions for datasets under 100GB.

Term supports all the validation patterns you'd expect - completeness checks, uniqueness validation, statistical analysis (mean, correlation, standard deviation), pattern matching, custom SQL expressions, and built-in OpenTelemetry integration for production observability. The entire setup takes less than 5 minutes - just `cargo add term-guard` and you're validating data.

GitHub: https://github.com/withterm/term

I built this because I was tired of seeing teams skip data validation entirely rather than deal with Spark infrastructure. With Term, you can add validation to any Rust data pipeline with minimal overhead and zero operational complexity.

Coming next: Python/Node.js bindings, streaming support, and database connectivity. I'm particularly excited about making this accessible beyond the Rust ecosystem.

I'd love feedback on:

- The validation API - does it cover your use cases?

- Performance on your real-world datasets

- What validation patterns you need that aren't supported yet

- Ideas for the Python/Node.js API design

Happy to dive into technical details about DataFusion integration, performance optimizations, or anything else!

Comments (0)

No comments yet