Show HN: I built a tensor library from scratch in C++/CUDA

46 nirw4nna 4 6/18/2025, 3:20:05 PM github.com ↗
Hi HN,

Over the past few months, I've been building `dsc`, a tensor library from scratch in C++/CUDA. My main focus has been on getting the basics right, prioritizing a clean API, simplicity, and clear observability for running small LLMs locally.

The key features are: - C++ core with CUDA support written from scratch. - A familiar, PyTorch-like Python API. - Runs real models: it's complete enough to load a model like Qwen from HuggingFace and run inference on both CUDA and CPU with a single line change[1]. - Simple, built-in observability for both Python and C++.

Next on the roadmap is adding BF16 support and then I'll be working on visualization for GPU workloads.

The project is still early and I would be incredibly grateful for any feedback, code reviews, or questions from the HN community!

GitHub Repo: https://github.com/nirw4nna/dsc

[1]: https://github.com/nirw4nna/dsc/blob/main/examples/models/qw...

Comments (4)

aklein · 5m ago
I noticed you interface with the native code via ctypes. I think cffi is generally preferred. Although you'd have more flexibility if you build a python extension module (eg using pybind), which will free you from a simple ABI. Curious if this strict separation of C & Python was a deliberate design choice.
kajecounterhack · 55m ago
Cool stuff! Is the goal of this project personal learning, inference performance, or something else?

Would be nice to see how inference speed stacks up against say llama.cpp

liuliu · 22m ago
Both uses cublas under the hood. So I think it is similar for prefilling (of course, this framework is too early and don't have FP16 / BF16 support for GEMM it seems). Hand-roll gemv is faster for token generation hence llama.cpp is better.
helltone · 2h ago
This is very cool. I'm wondering if some of the templates and switch statements would be nicer if there was an intermediate representation and a compiler-like architecture.

I'm also curious about how this compares to something like Jax.

Also curious about how this compares to zml.