Activeloop(YC S18)Is Hiring Senior Backend and AI Search Engineer(Mountain View) (careers.activeloop.ai)

1 points by davidbuniat 15d ago 0 comments

Morph (YC S23) Is Hiring a ML Engineer

1 points by bhaktatejas922 15d ago 0 comments

Spark AI (YC W24) Is Hiring a Full Stack Engineer in San Francisco (ycombinator.com)

1 points by tk90 15d ago 0 comments

Demodesk (YC W19) Is Hiring Rails Engineers (demodesk.com)

1 points by alxppp 15d ago 0 comments

Piramidal (YC W24) Is Hiring a Senior Full Stack Engineer (ycombinator.com)

1 points by dsacellarius 16d ago 0 comments

Show HN: I built a tensor library from scratch in C++/CUDA

78 nirw4nna 17 6/18/2025, 3:20:05 PM github.com ↗

Hi HN,

Over the past few months, I've been building `dsc`, a tensor library from scratch in C++/CUDA. My main focus has been on getting the basics right, prioritizing a clean API, simplicity, and clear observability for running small LLMs locally.

The key features are: - C++ core with CUDA support written from scratch. - A familiar, PyTorch-like Python API. - Runs real models: it's complete enough to load a model like Qwen from HuggingFace and run inference on both CUDA and CPU with a single line change[1]. - Simple, built-in observability for both Python and C++.

Next on the roadmap is adding BF16 support and then I'll be working on visualization for GPU workloads.

The project is still early and I would be incredibly grateful for any feedback, code reviews, or questions from the HN community!

GitHub Repo: https://github.com/nirw4nna/dsc

[1]: https://github.com/nirw4nna/dsc/blob/main/examples/models/qw...

Comments (17)

aklein · 4h ago

I noticed you interface with the native code via ctypes. I think cffi is generally preferred (eg, https://cffi.readthedocs.io/en/stable/overview.html#api-mode...). Although you'd have more flexibility if you build your own python extension module (eg using pybind), which will free you from a simple/strict ABI. Curious if this strict separation of C & Python was a deliberate design choice.

nirw4nna · 56m ago

Yes, when I designed the API I wanted to keep a clear distinction between Python and C. At some point I had two APIs: 1 in Python and the other in high-level C++ and they both shared the same low-level C API. I find this design quite clean and easy to work with if multiple languages are involved. When I'll get to perf I plan to experiment a bit with nanobind (https://github.com/wjakob/nanobind) and see if there's a noticeable difference wrt ctypes.

almostgotcaught · 8m ago

The call overhead of using ctypes vs nanobind/pybind is enormous

https://news.ycombinator.com/item?id=31378277

Even if the number reported there is off, it's not far off because ctypes just calls out to libffi which is known to be the slowest way to do ffi.

kajecounterhack · 4h ago

Cool stuff! Is the goal of this project personal learning, inference performance, or something else?

Would be nice to see how inference speed stacks up against say llama.cpp

nirw4nna · 36m ago

Thanks! To be honest, it started purely as a learning project. I was really inspired when llama.cpp first came out and tried to build something similar in pure C++ (https://github.com/nirw4nna/YAMI), mostly for fun and to practice low-level coding. The idea for DSC came when I realized how hard it was to port new models to that C++ engine, especially since I don't have a deep ML background. I wanted something that felt more like PyTorch, where I could experiment with new architectures easily. As for llama.cpp, it's definitely faster! They have hand-optimizing kernels for a whole bunch of architectures, models and data types. DSC is more of a general-purpose toolkit. I'm excited to work on performance later on, but for now, I'm focused on getting the API and core features right.

liuliu · 4h ago

Both uses cublas under the hood. So I think it is similar for prefilling (of course, this framework is too early and don't have FP16 / BF16 support for GEMM it seems). Hand-roll gemv is faster for token generation hence llama.cpp is better.

helltone · 6h ago

This is very cool. I'm wondering if some of the templates and switch statements would be nicer if there was an intermediate representation and a compiler-like architecture.

I'm also curious about how this compares to something like Jax.

Also curious about how this compares to zml.

nirw4nna · 27m ago

You are absolutely correct! I started working on a sort of compiler a while back but decided to get the basics down first. The templates and switch(s) are not really the issue but rather going back and forth between C & Python. This is an experiment I did a few months ago: https://x.com/nirw4nna/status/1904114563672354822 as you can see there is a ~20% perf gain just by generating a naive C++ kernel instead of calling 5 separate kernels in the case of softmax.

rrhjm53270 · 3h ago

Do you have any plan for the serialization and deserialization of your tensor and nn library?

nirw4nna · 21m ago

Right now I can load tensors directly from a safetensors file or from a NumPy array so I don't really have in mind to add my own custom format but I do plan to support GGUF files.

einpoklum · 57m ago

It's very C-like, heavy use of macros, prefixes instead of namespaces, raw pointers for arrays etc. Technically you're compiling C++, but... not really.

No negative or positive comment on its usability though, I'm not an ML/Neural Network simulation person.

caned · 28m ago

I've found adherence to C++ conventions in low-level software to be a rather contentious issue, mostly recently when working in an ML compiler group. One set abhorred the use of macros, the other any kind of polymorphism or modern C++ feature.

Coming from a background of working with OS kernels and systems software, I don't mind the kind of explicit "C++ lite" style used by the OP. Left to my own devices, I usually write things that way. I would think twice if I was trying to design a large framework, but ... I try to avoid those.

amtk2 · 2h ago

super n00b question , what kind of labtop do you need to do project like this? Is mac ok? or do you need dedicated linux labtop?

nirw4nna · 19m ago

I developed this on an HP Omen 15 with an i7-8750H, a GTX 1050TI and 32GB or RAM with Linux Mint as my OS.

kadushka · 2h ago

Any laptop with an Nvidia card

amtk2 · 1h ago

does gaming labtop works with windows? have always used mac for development because tool chain is so much easier, wondering if there is a difference between windows and linux for cuda development

Anerudhan · 42m ago

You can always use WSL

Attimet (YC F24) – Quant Trading Research Lab – Is Hiring Founding Engineer (ycombinator.com)

Jiga (YC W21) Is Hiring Software Engs to Make Life of Mech Engs Easier (workatastartup.com)

Foundry (YC F24) Hiring Early Engineer to Build Web Agent Infrastructure (ycombinator.com)

Blaze (YC S24) Is Hiring (ycombinator.com)

Infracost (YC W21) is hiring software engineers (GMT+2 to GMT-6) (infracost.io)

Solidroad (YC W25) Is Hiring (solidroad.com)

Kyber (YC W23) Is Hiring a Technical Account Manager (ycombinator.com)

Roundtable (YC S23) Is Hiring a President / CRO (ycombinator.com)

Roame (YC S23) Is Hiring (ycombinator.com)

GauntletAI (YC S17): All expenses paid AI training and guaranteed $200k+ job (gauntletai.com)

SchemeFlow (YC S24) Is Hiring an Engineer (London) to Speed Up Construction (ycombinator.com)

Shaped (YC W22) Is Hiring (ycombinator.com)

Spice Data (YC S19) is hiring a software engineer – back end (ycombinator.com)

Onlook (YC W25) Is Hiring an engineer in SF

OneText (YC W23) Is Hiring a DevOps/DBA Lead Engineer (jobs.ashbyhq.com)

Gander (YC F24) Is Hiring Founding Engineers and Interns (ycombinator.com)

Ziina (YC W21) the Series A fintech is hiring product engineers (ziina.notion.site)

Onyx (YC W24) – AI Assistants for Work Hiring Founding AE (ycombinator.com)

Great Question (YC W21) Is Hiring a Director of Customer Success (ycombinator.com)

Deepnote (YC S19) is hiring engineers to build an AI-powered data notebook (deepnote.com)

Converge (YC S23) Well-capitalized New York startup seeks product developers (runconverge.com)

CircuitHub (YC W12) is hiring full-stack robotics engineers (workatastartup.com)

AtoB (YC S20) – Stripe for Transportation – is hiring engineers (jobs.ashbyhq.com)

PromptArmor (YC W24) Is Hiring in San Francisco (ycombinator.com)

Depot (YC W23) is hiring an enterprise support engineer (UK/EU) (ycombinator.com)

Patched (YC S24) Is Hiring SWEs in Singapore (ycombinator.com)