Show HN: Hyperparam: OSS Tools for Exploring Datasets Locally in the Browser

26 platypii 16 5/1/2025, 2:06:55 PM hyperparam.app ↗
For the last year I’ve been developing Hyperparam — a collection of small, fast, dependency-free open-source libraries designed for data scientists and ML engineers to actually look at their data.

- Hyparquet: Read any Parquet file in browser/node.js

- Icebird: Explore Iceberg tables without needing Spark/Presto

- HighTable: Virtual scrolling of millions of rows

- Hyparquet-Writer: Export Parquet easily from JS

- Hyllama: Read llama.cpp .gguf LLM metadata efficiently

CLI for viewing local files: npx hyperparam dataset.parquet

Example dataset on Hugging Face Space: https://huggingface.co/spaces/hyperparam/hyperparam?url=http...

No cloud uploads. No backend servers. A better way to build frontend data applications.

GitHub: https://github.com/hyparam Feedback and PRs welcome!

Comments (16)

abeppu · 1h ago
Though these tools might be interesting, I wish they had called this something else. This isn't at all related to the concept of hyperparameters which people commonly refer to as hyperparams. And in their copy, the only reference to hyperparameters seems to be misusing the term.

> This stems from an industry-wide realization that model performance is ultimately bounded by data quality, not just model architecture or hyperparameters.

Generally we think of model architecture + weights (parameters) as making up the model itself, and hyperparam(s|eters) are the more relevant to how one arrives at those weights -- and for this reason are more relevant to the efficacy of training than the performance of the resultant model.

platypii · 1h ago
That's fair criticism... to be honest when I started the project it was more focused on hyperparameters, and it evolved into this javascript-for-ai mission. But now I just kind of liked the name.
wbradmoore · 2h ago
Why not WASM? Seems like something like duckdb-wasm or datafusion-wasm can do the same thing?
platypii · 2h ago
Duckdb and datafusion are super cool! But they are VERY large wasm blobs (30-40mb each). This is often larger than the data you’re trying to load. And they add complexity with serving and deploying wasm files.

Hyparquet is 10kb of pure js, and so its trivial to deploy on a modern webapp, and wins hands down on time-to-first-data metric.

abeppu · 1h ago
> Duckdb and datafusion are super cool! But they are VERY large wasm blobs (30-40mb each). This is often larger than the data you’re trying to load.

I don't know how to reconcile this with the emphasis in the page on interacting with datasets relevant to AI which are commonly several orders of magnitude larger than this. What's an AI problem where the data data involved has been less than 10s of mb? I think that only toy problems and datasets could plausibly be smaller (e.g. the training images for the classic MNIST dataset are 47MB, and the whole dataset is 55 https://www.kaggle.com/datasets/hojjatk/mnist-dataset?select... ).

platypii · 1h ago
Yea except with parquet you don't need to load the entire file, the parquet metadata let's you do http range requests for just the data you need.

For example this parquet is the entire english wikipedia (400mb) but loads less than 4mb including html and all js to display the first rows:

https://hyperparam.app/files?key=https%3A%2F%2Fs3.hyperparam...

This way you can have huge AI datasets in cloud storage, and still have a nice interface for looking at your data.

In particular, a lot of modern AI datasets are huge walls of text (web scrapes, chains of thought, or agentic conversation histories), and most datasets on huggingface are in parquet. So you can much more quickly look at your data this way versus say jupyter notebooks.

Here's the glaive reasoning dataset on the Hyperparam hugging face space:

https://huggingface.co/spaces/hyperparam/hyperparam?url=http...

xoofoog · 1m ago
Wow - that's super clever. How do you get away with loading part of the file? Which part do you load?
klntsky · 1h ago
That's a lot of names for a bunch of tools that do a single task each.

What I would really benefit of is a hypothetical LLM chat app that is focused on data migration or processing pipelines.

platypii · 1h ago
Funny you say that, because I built these tools because I wanted to build something very much like what you're describing!

I was trying to look at, filter, and transform large AI datasets, and I was frustrated with how bad the existing tool was for working with datasets with huge amounts of text (web scrapes, github dumps, reasoning tokens, agent chat logs). Jupyter notebook is woefully bad at helping you to look at your data.

So I wanted to build better browser tools for working with AI datasets. But to do that I had to build these tools (there was no working parquet implementation in JS when I started).

Anyway I'm still working on building an app for data processing using LLM chat assistant to help a single user curate entire datasets singlehandedly. But for now I'm releasing these components to the community as open source. And having them "do a single task each" was very much intentional. Thanks for the comment!

cyrdax · 52m ago
Anyone benchmark this vs. duckdb-wasm?
platypii · 39m ago
I don’t have benchmarks specifically against duckdb. I’m sure native C++ will run faster than JavaScript.

But whats important is that with Hyperparam you can do it in the browser, where the bottleneck will always be network-bound not cpu-bound.

dmosites · 2h ago
The iceberg reader sounds cool but how does it handle auth? Most iceberg tables are not publicly accessible.
platypii · 1h ago
It does support using S3 presigned requests, but it's admittedly a little awkward to ask a server for a presigned request before every fetch. But does still have the benefit that you can have a small and light server just handing out signed requests, and then the user and their browser does the heavy lifting. This can save a lot on scaling out server costs.

That being said, I wish there was a better auth story. Open to suggestions if anyone has ideas!

doppenhe · 2h ago
Very cool, does `npx hyperparam dataset.parquet` phone home?
platypii · 2h ago
Zero telemetry, fully local. It spawns `http-server` on port 2048 and opens your browser at `localhost`. Similar pattern as Jupyter Notebooks. Feel free to audit the code... the server is <200 LOC.
lorr1 · 1h ago
You’re right. Pythons the worst