Show HN: I built a tensor library from scratch in C++/CUDA

62 nirw4nna 5 6/18/2025, 3:20:05 PM github.com ↗
Hi HN,

Over the past few months, I've been building `dsc`, a tensor library from scratch in C++/CUDA. My main focus has been on getting the basics right, prioritizing a clean API, simplicity, and clear observability for running small LLMs locally.

The key features are: - C++ core with CUDA support written from scratch. - A familiar, PyTorch-like Python API. - Runs real models: it's complete enough to load a model like Qwen from HuggingFace and run inference on both CUDA and CPU with a single line change[1]. - Simple, built-in observability for both Python and C++.

Next on the roadmap is adding BF16 support and then I'll be working on visualization for GPU workloads.

The project is still early and I would be incredibly grateful for any feedback, code reviews, or questions from the HN community!

GitHub Repo: https://github.com/nirw4nna/dsc

[1]: https://github.com/nirw4nna/dsc/blob/main/examples/models/qw...

Comments (5)

aklein · 1h ago
I noticed you interface with the native code via ctypes. I think cffi is generally preferred (eg, https://cffi.readthedocs.io/en/stable/overview.html#api-mode...). Although you'd have more flexibility if you build your own python extension module (eg using pybind), which will free you from a simple/strict ABI. Curious if this strict separation of C & Python was a deliberate design choice.
rrhjm53270 · 38m ago
Do you have any plan for the serialization and deserialization of your tensor and nn library?
kajecounterhack · 2h ago
Cool stuff! Is the goal of this project personal learning, inference performance, or something else?

Would be nice to see how inference speed stacks up against say llama.cpp

liuliu · 1h ago
Both uses cublas under the hood. So I think it is similar for prefilling (of course, this framework is too early and don't have FP16 / BF16 support for GEMM it seems). Hand-roll gemv is faster for token generation hence llama.cpp is better.
helltone · 3h ago
This is very cool. I'm wondering if some of the templates and switch statements would be nicer if there was an intermediate representation and a compiler-like architecture.

I'm also curious about how this compares to something like Jax.

Also curious about how this compares to zml.