Show HN: I built a tensor library from scratch in C++/CUDA
Over the past few months, I've been building `dsc`, a tensor library from scratch in C++/CUDA. My main focus has been on getting the basics right, prioritizing a clean API, simplicity, and clear observability for running small LLMs locally.
The key features are: - C++ core with CUDA support written from scratch. - A familiar, PyTorch-like Python API. - Runs real models: it's complete enough to load a model like Qwen from HuggingFace and run inference on both CUDA and CPU with a single line change[1]. - Simple, built-in observability for both Python and C++.
Next on the roadmap is adding BF16 support and then I'll be working on visualization for GPU workloads.
The project is still early and I would be incredibly grateful for any feedback, code reviews, or questions from the HN community!
GitHub Repo: https://github.com/nirw4nna/dsc
[1]: https://github.com/nirw4nna/dsc/blob/main/examples/models/qw...
https://news.ycombinator.com/item?id=31378277
Even if the number reported there is off, it's not far off because ctypes just calls out to libffi which is known to be the slowest way to do ffi.
Would be nice to see how inference speed stacks up against say llama.cpp
I'm also curious about how this compares to something like Jax.
Also curious about how this compares to zml.
No negative or positive comment on its usability though, I'm not an ML/Neural Network simulation person.
Coming from a background of working with OS kernels and systems software, I don't mind the kind of explicit "C++ lite" style used by the OP. Left to my own devices, I usually write things that way. I would think twice if I was trying to design a large framework, but ... I try to avoid those.