Effect of leachates from black recycled polyethylene plastics on mRNA expression (sciencedirect.com)

For the last year I’ve been developing Hyperparam — a collection of small, fast, dependency-free open-source libraries designed for data scientists and ML engineers to actually look at their data.

- Hyparquet: Read any Parquet file in browser/node.js

- Icebird: Explore Iceberg tables without needing Spark/Presto

- HighTable: Virtual scrolling of millions of rows

- Hyparquet-Writer: Export Parquet easily from JS

- Hyllama: Read llama.cpp .gguf LLM metadata efficiently

CLI for viewing local files: npx hyperparam dataset.parquet

Example dataset on Hugging Face Space: https://huggingface.co/spaces/hyperparam/hyperparam?url=http...

No cloud uploads. No backend servers. A better way to build frontend data applications.

GitHub: https://github.com/hyparam Feedback and PRs welcome!

Comments (21)

abeppu · 68d ago

Though these tools might be interesting, I wish they had called this something else. This isn't at all related to the concept of hyperparameters which people commonly refer to as hyperparams. And in their copy, the only reference to hyperparameters seems to be misusing the term.

> This stems from an industry-wide realization that model performance is ultimately bounded by data quality, not just model architecture or hyperparameters.

Generally we think of model architecture + weights (parameters) as making up the model itself, and hyperparam(s|eters) are the more relevant to how one arrives at those weights -- and for this reason are more relevant to the efficacy of training than the performance of the resultant model.

platypii · 68d ago

That's fair criticism... to be honest when I started the project it was more focused on hyperparameters, and it evolved into this javascript-for-ai mission. But now I just kind of liked the name.

wbradmoore · 68d ago

Why not WASM? Seems like something like duckdb-wasm or datafusion-wasm can do the same thing?

platypii · 68d ago

Duckdb and datafusion are super cool! But they are VERY large wasm blobs (30-40mb each). This is often larger than the data you’re trying to load. And they add complexity with serving and deploying wasm files.

Hyparquet is 10kb of pure js, and so its trivial to deploy on a modern webapp, and wins hands down on time-to-first-data metric.

abeppu · 68d ago

> Duckdb and datafusion are super cool! But they are VERY large wasm blobs (30-40mb each). This is often larger than the data you’re trying to load.

I don't know how to reconcile this with the emphasis in the page on interacting with datasets relevant to AI which are commonly several orders of magnitude larger than this. What's an AI problem where the data data involved has been less than 10s of mb? I think that only toy problems and datasets could plausibly be smaller (e.g. the training images for the classic MNIST dataset are 47MB, and the whole dataset is 55 https://www.kaggle.com/datasets/hojjatk/mnist-dataset?select... ).

platypii · 68d ago

Yea except with parquet you don't need to load the entire file, the parquet metadata let's you do http range requests for just the data you need.

For example this parquet is the entire english wikipedia (400mb) but loads less than 4mb including html and all js to display the first rows:

https://hyperparam.app/files?key=https%3A%2F%2Fs3.hyperparam...

This way you can have huge AI datasets in cloud storage, and still have a nice interface for looking at your data.

In particular, a lot of modern AI datasets are huge walls of text (web scrapes, chains of thought, or agentic conversation histories), and most datasets on huggingface are in parquet. So you can much more quickly look at your data this way versus say jupyter notebooks.

Here's the glaive reasoning dataset on the Hyperparam hugging face space:

https://huggingface.co/spaces/hyperparam/hyperparam?url=http...

xoofoog · 68d ago

Wow - that's super clever. How do you get away with loading part of the file? Which part do you load?

chatmasta · 68d ago

I’m not OP but as this is a common pattern…

Parquet stores the metadata in the footer so first request is effectively a negative byte range (content length minus footer length). This metadata includes table statistics like “column ‘date_sold’ has minimum date 1-1-1970 and maximum date 12-31-2024,” and row group statistics like “the row group at byte offset X has minimum ‘date_sold’ value of 1-1-2023 and maximum ‘1-1-2024’.”

So if your query tool gets a SQL query with a predicate like “WHERE date_sold > ‘3-1-2024’ AND date_sold < ‘3-30-2024’” then it can use “partition pruning” to fetch only the RowGroup of the parquet file that includes the March 2024 data.

My colleague Artjoms (and co-founder of Splitgraph with me) gave a great presentation [0] on how we achieved this with DataFusion, including visualization of the pruning.

[0] https://youtube.com/watch?v=D_phetiS-4w

klntsky · 68d ago

That's a lot of names for a bunch of tools that do a single task each.

What I would really benefit of is a hypothetical LLM chat app that is focused on data migration or processing pipelines.

platypii · 68d ago

Funny you say that, because I built these tools because I wanted to build something very much like what you're describing!

I was trying to look at, filter, and transform large AI datasets, and I was frustrated with how bad the existing tool was for working with datasets with huge amounts of text (web scrapes, github dumps, reasoning tokens, agent chat logs). Jupyter notebook is woefully bad at helping you to look at your data.

So I wanted to build better browser tools for working with AI datasets. But to do that I had to build these tools (there was no working parquet implementation in JS when I started).

Anyway I'm still working on building an app for data processing using LLM chat assistant to help a single user curate entire datasets singlehandedly. But for now I'm releasing these components to the community as open source. And having them "do a single task each" was very much intentional. Thanks for the comment!

yujian · 68d ago

It's super interesting to be able to see the data in the web

dmosites · 68d ago

The iceberg reader sounds cool but how does it handle auth? Most iceberg tables are not publicly accessible.

platypii · 68d ago

It does support using S3 presigned requests, but it's admittedly a little awkward to ask a server for a presigned request before every fetch. But does still have the benefit that you can have a small and light server just handing out signed requests, and then the user and their browser does the heavy lifting. This can save a lot on scaling out server costs.

That being said, I wish there was a better auth story. Open to suggestions if anyone has ideas!

barabbababoon · 68d ago

Very cool stuff. Is this some kind of lighter weight duckdb-wasm? did I get this right?

doppenhe · 68d ago

Very cool, does `npx hyperparam dataset.parquet` phone home?

platypii · 68d ago

Zero telemetry, fully local. It spawns `http-server` on port 2048 and opens your browser at `localhost`. Similar pattern as Jupyter Notebooks. Feel free to audit the code... the server is <200 LOC.

newusertoday · 68d ago

very nice. I wanted something like this for Parquet but couldn't find one, this one looks great.

cyrdax · 68d ago

Anyone benchmark this vs. duckdb-wasm?

platypii · 68d ago

I don’t have benchmarks specifically against duckdb. I’m sure native C++ will run faster than JavaScript.

But whats important is that with Hyperparam you can do it in the browser, where the bottleneck will always be network-bound not cpu-bound.

lorr1 · 68d ago

You’re right. Pythons the worst

pranshu54 · 67d ago

Looks interesting!

On Far Memory (malloc.dog)

XAI's Hitler Praising Grok Update (cbsnews.com)

QEMU's AI code ban: How open source projects can accept AI-generated code (shujisado.org)

Bluefin: Linux Workstation OS (projectbluefin.io)

Hype AI vs. Pragmatic AI (senkorasic.com)

Introduction to SaaS Taxability in the US (stripe.com)

Show HN: Crowdsourced List of Festivals (docs.google.com)

HyAB k-means for color quantization (30fps.net)

Memento Project Is Down (timetravel.mementoweb.org)

Show HN: Pattern recognition engine, buying signals from unstructured web data (dealmayker.com)

Chasing the Right "Zero." (merlin.ghost.io)

Nuclear Explosions for Large Scale Carbon Sequestration (arxiv.org)

'Harmless' virus might trigger Parkinson's disease, researchers say (medicalxpress.com)

The sex dolls are coming (unherd.com)

Justice Department's Antitrust Division Announces Whistleblower Rewards Program (justice.gov)

Google Workspace MCP Server (github.com)

Turbo Knitting Machine 5 Minute...ish Beanie (instructables.com)

Infinity is not a big number (mnvr.in)

OpenAI Codex Is Generating 10% of Public GitHub PRs

Plasmonic metasurfaces with unclonable stochastic scattering for authentication (nature.com)

Social Tinkering: Why Collaborative Curiosity Beats Vibe-Coding (cosmosinstitute.substack.com)

Inflight Auctions (jefftk.com)

Challenge: Name Coloring in War and Peace (dynomight.net)

Musical Perception in Fiction (arbesman.substack.com)

Silicon Valley Defense Group NatSec 100 (2025) (natsec100.org)

Emerson, AI, and the Force – Neal Stephenson (nealstephenson.substack.com)

Using Gemini and Claude for SQL Analytics (benjaminwootton.com)

I ride AI-hype train (blog-doe.pages.dev)

Effect of leachates from black recycled polyethylene plastics on mRNA expression (sciencedirect.com)

What Happened to Tesla's Annual Shareholders Meeting? (nytimes.com)

Matter only solves about one of the smart home core issues (2022) (staceyoniot.com)

Whitepaper: Decentralized Protocol for Verifiable LLM Training and Fine-Tuning [pdf] (github.com)

Show HN: Xkcd comic battle (Elo rating) (xkcdcomicbattle.party)

Kodak Film Fogged by the Trinity Test (1945) (orau.org)

Show HN: Consensus-Based Optimization Visualized (pdips.github.io)

The AI Industry Is Radicalizing (theatlantic.com)

Electronics, Technology and Computer Science, 1940-1975: A Coevolution (fermatslibrary.com)

Gmail launches new "Manage Subscriptions" view (blog.google)

Show HN: I built "Schnippi", my dream screenshot Chrome extension (chromewebstore.google.com)

Abound.art – artists share generative algorithms, and tinkerers tweak parameters (abound.art)

[Deleted] (files.catbox.moe)

Show HN: GrowIn – AI assistant to automate your LinkedIn content and growth

After quitting antidepressants, some people suffer lingering symptoms (npr.org)

Adam Dorr on how robots will take our jobs (theguardian.com)

Show HN: AI Agent for ESG analysis (ghg-finder.vercel.app)

What Is Blue and How Do We See Color? (2015) (businessinsider.com)

Privacy campaigners pour cold water on London cops 1k facial recognition arrests (theregister.com)

After SOPA's Painful Death, Safe Site Blocking Claim Disputed by Cloudflare (torrentfreak.com)

4.6B Years On, the Sun Is Having a Moment (newyorker.com)

Clean energy has caused China’s emissions to drop for the first time (weforum.org)

Show HN: Hyperparam: OSS tools for exploring datasets locally in the browser

Comments (21)